Comparison of 3D Sensors for Automating Bolt-Tightening Operations in the Automotive Industry

Machine vision systems are widely used in assembly lines for providing sensing abilities to robots to allow them to handle dynamic environments. This paper presents a comparison of 3D sensors for evaluating which one is best suited for usage in a machine vision system for robotic fastening operations within an automotive assembly line. The perception system is necessary for taking into account the position uncertainty that arises from the vehicles being transported in an aerial conveyor. Three sensors with different working principles were compared, namely laser triangulation (SICK TriSpector1030), structured light with sequential stripe patterns (Photoneo PhoXi S) and structured light with infrared speckle pattern (Asus Xtion Pro Live). The accuracy of the sensors was measured by computing the root mean square error (RMSE) of the point cloud registrations between their scans and two types of reference point clouds, namely, CAD files and 3D sensor scans. Overall, the RMSE was lower when using sensor scans, with the SICK TriSpector1030 achieving the best results (0.25 mm ± 0.03 mm), the Photoneo PhoXi S having the intermediate performance (0.49 mm ± 0.14 mm) and the Asus Xtion Pro Live obtaining the higher RMSE (1.01 mm ± 0.11 mm). Considering the use case requirements, the final machine vision system relied on the SICK TriSpector1030 sensor and was integrated with a collaborative robot, which was successfully deployed in an vehicle assembly line, achieving 94% success in 53,400 screwing operations.


Introduction
Smart manufacturing began when the vision of Industry 4.0 expanded the efficiency and flexibility expectations of automation by emphasizing on process digitalization. Now, over a decade later and witnessing the digital transformation in industries and our shared challenges in society, a new paradigm shift is changing the expectations from the industries to go beyond efficiency and accept their role as sustainable service providers for society.
The worldwide automotive industry is already progressing toward the Industry 5.0 transformation. They are, in fact, one of the leading sectors in terms of the adoption of new technologies, seeking to combine human expertise with the capabilities of intelligent machines to improve manufacturing processes and empower workers, increasing product personalization, while retaining or even enhancing their quality. This desire for rapid adoption stems from the automakers' and suppliers' needs to quickly respond to changes in market demand, resulting in increasingly customized products with shorter lead times.
To achieve these goals, collaborative robotic solutions are one of the most important technologies of Industry 4.0 and 5.0. They allow for the creation of innovative solutions to automate manufacturing processes and provide flexibility to the production system. Using machine vision, these robots can achieve new levels of autonomy by being able to edge precision, spatial resolution (number of measurements per cm 2 at a given distance), radius reconstruction accuracy and surface continuity.For visually displaying the difference between the sensor data and the surface scanning models (plane, sphere, cube, box, cylinder and dodecahedron), Chen et al. [17] also relied on color maps to complement the RMSE analysis.
Besides performing the comparison of the sensors' performance, this paper also provides the success rate of an automated bolt-tightening machine that relies on one of the sensors under analysis. Unlike other approaches of bolt tightening that perform perception of the bolt itself [18][19][20][21], the system deployed relied on the 3D perception of the structure of the van in which the bolts were attached. This approach allowed for unambiguous 6 degrees of freedom (DoF) pose estimation, as the van's structure has a surface with unique geometry and a higher number of points compared to the bolts. Furthermore, the proposed approach does not have the problem of ambiguity in the 6-DoF pose of the bolts due to their symmetry axes.
In the assembly line where the automated tightening machine was deployed, a van is transported by an aerial conveyor throughout different workstations, and when it arrives at the workstation under analysis, the operator picks an axle damper from a bin, goes underneath the vehicle, and installs it in the van rear undercarriage, which has two attachment points. Later on, the operator places the respective two bolts and gives them a few turns. Next, using a pantograph, the operator guides two electric screwdrivers for fastening both bolts at once. For each van, this process is performed for the left and right rear wheels. This task is repetitive and non-ergonomic, which may cause musculoskeletal injuries to the human operator. As such, the goal is to develop a collaborative robotic solution, where the operator is responsible for placing the axle damper and the bolts, and then the robot performs the fastening operations using two electric screwdrivers. For ensuring reliable operation, the collaborative robotic system must be capable of locating the axle damper attachment pose and fastening the two bolts using the two electric screwdrivers. Since the van 6-DoF pose at this specific workstation varies due to the mechanical tolerances of the aerial conveyor and weight of the vehicle, the robot needs to perceive the pose of the axle damper attachment structure in order to successfully perform the bolt-tightening operations.
The remaining of this paper is organized as follows. Section 2 presents some fundamental concepts regarding 3D sensing technology. Section 3 describes in detail the comparison of the 3D sensors within the described use case, including the methodology used for this comparison, the results obtained with each sensor and the respective discussion. Section 4 describes the integration of the sensor on the collaborative robotic solution and the assessment of its performance. Section 5 presents the conclusions of this study.

3D Sensing Technologies
Three-dimensional sensors can be classified as active or passive according to their imaging technology [22]. Passive sensors, such as stereo cameras, rely on the light reflected from external sources for observing the environment, while active sensors rely on their own source of radiation for probing the environment, making them more robust for scanning textureless surfaces and dark environments. Examples of active sensing technologies include but are not limited to laser triangulation, structured light and time of flight.
One of the most reliable and accurate optical sensing technology is laser triangulation (point or line). The resulting point cloud 3D data are computed by interpreting the deformation of the laser line when observed from the camera perspective. Coupled with a known movement of the object on a conveyor or the sensor mounted on a track or robotic arm, several 3D scan profiles can be merged together to form a 3D point cloud of the surface to be analyzed. These sensors are usually small and have a high acquisition rate (1 kHz). The most significant disadvantage of this technology is the requirement to generate a known movement, either of the object or of the sensor itself. Despite this, 3D laser triangulation is often chosen because it provides greater robustness not only in terms of variations in ambient light but also in terms of the materials and color of the objects of interest, making it desirable in many industrial applications [23].
Structured light sensors consist of a light projector and one or more cameras [24]. The light source projects a set of known patterns into the environment, which are distorted when they hit the surface of objects. Depending on the pattern used, one or more images need to be captured. For example, a speckle pattern is static and needs only one image capture for generating sparse depth information, which can be coupled with measurements interpolation to increase the point cloud density. On the other hand, sequential stripe patterns can achieve dense surface measurements with much higher point cloud density but require several image captures with a static environment, one for each pattern with decreasing stripe thickness. The 3D sensors based on structured light are one of the most used in the industry for 3D perception and inspection tasks, given their high accuracy, high density point cloud and robustness for scanning textureless surfaces and operating in a wide range of light conditions. Time-of-flight (ToF) cameras rely on infrared light pulses for probing the environment. They estimate the distance to objects by measuring the time difference between the pulse emission and the detection of the reflected signal [25]. The interest in these sensors has been increasing mainly due to their applicability in autonomous vehicles. Typically, these sensors have less accurate 3D data when compared to structured light sensors and generate a lot more shadow/veiling points on the border of objects, but they have a much higher data acquisition rate.
Stereo vision systems can perform 3D reconstruction of the scene by calculating the correspondence between pixels of two different images taken by cameras in different perspectives using triangulation. Since the accuracy of 3D measurements depends heavily on identifying and correctly matching points between images from different cameras, some stereo vision systems project a pattern into the environment to refine point matching. This approach significantly improves the measurement accuracy in low-texture environments. However, the consistency of the measurements is not as reliable as the previously mentioned technologies. Moreover, passive stereo systems have higher measurements errors when operating in low-light environments. These sensors are less used in industrial applications because of these limitations [26].

Selection of the 3D Sensors
Different types of sensors can be used depending on the requirements of the machine vision system. The goal of the use case under analysis is to determine with high accuracy (<3 mm) the 6-DoF pose of the axle damper attachment structure, which does not have texture and is painted with partially reflective white color. Moreover, the sensor would be operating without controlled light conditions and needs to be compact to be mounted on the end effector of a robotic arm.
With this use case in mind, three sensors based on active depth sensing technologies were chosen, namely, the SICK TriSpector1030 (laser line triangulation), the Photoneo PhoXi S (visible stripe pattern structured light) and the Asus Xtion Pro Live (IR speckle pattern structured light). They are presented in Figure 1, and their technical specifications are presented in Table 1.
From these three sensors, only the 3D point cloud was evaluated. The 2D image provided by the Photoneo and Asus was not considered since the goal is to estimate the 6-DoF pose of the target object. Moreover, the SICK does not provide a 2D image; it only provides 3D data.
In relation to embedded processing, only the SICK has this feature, in which the sensor is programmed using the SOPAS Engineering Tool software. The Photoneo and Asus need an external PC to process the sensor data. In order to have a fair comparison between these sensors, the embedded capabilities of the SICK sensor were not used, and the evaluation software was run in an external PC for processing the 3D data retrieved from each of the three sensors.
Other sensors were considered, such as the Intel Realsense D435 (active stereo sensor), but from our preliminary tests, it had slightly higher surface measurement error when compared to the Asus Xtion Pro Live. From the ToF sensor technology field, we also analyzed the Azure Kinect, but it had a lot of shadow/veiling points, which complicated the 3D data segmentation stage of the axle damper (even though this issue could be mitigated with the statistical outlier removal filter from the point cloud library (PCL) [27]). From the passive stereo range of sensors, the Nerian Scarlet and the Carnegie Robotics Multisense could be possible candidates, but since this passive sensing technology would not perform well in the non-textured surface of the van, we opted not to include them in the comparison.
In the end, we chose to keep the sensor comparison concise and limit the scope to the two main candidates (SICK TriSpector1030 and Photoneo PhoXi S), which were built for industrial use cases, while also including a lower cost sensor from the nonindustrial/consumer marker for having a entry level sensor in the comparison results. In the future, we might consider the Ensenso N35 since it is an industrial rated sensor and from its specifications it would perform between the Photoneo PhoXi S and the Asus Xtion Pro Live.

Methodology for Evaluating the 3D Sensors
To compare the performance of the three sensors, point clouds of the axle damper attachment structure with and without the axle damper were acquired in a testing environment similar to the real assembly line, namely, with a real van and axle damper samples that could be manually placed and removed (as shown in Figure 2).  The structured light sensors were mounted on a tripod since they need to be static during the scanning procedure. The same does not apply to the laser line triangulation sensor, which was installed on the end effector of a robotic arm (Doosan H2017) to capture the 3D profiles of the region of interest (ROI). Figure 3 shows the location of the sensors with respect to the van undercarriage.  For creating the sensor evaluation dataset, 12 scans were captured for both sides of the van with each sensor. For performing the comparison, the RMSE metric presented in the Equation (1) was used, which computes the mean distance between the sensor data points (S) and their respective closest point in the reference point cloud (R): It is important to mention that in this testing workstation, the van was fixed to a rigid structure and not to an aerial conveyor (that was present in the assembly line workstation). Therefore, to simulate the aerial conveyor deviations on the van positioning, the sensors poses were slightly changed manually before performing 3D scanning.
Two reference point clouds were used as the ground truth, one extracted from the scans without the axle dampers and another from the CAD model of the van, Figure 4. The CAD model was provided by the car manufacturing company, and a ROI was applied to extract the required surface section.  To create the reference point cloud from the 3D sensors scans, the point cloud without the axle dampers was filtered and segmented with the following steps:

1.
Scan the van without the axle damper.

2.
Crop the point cloud if necessary (only applicable for the Asus Xtion Pro Live sensor due to its higher scan volume).

3.
Segment the point cloud into clusters using the region growing segmentation algorithm [28].

4.
Extract the ROI where the axle damper will be placed (considering the acquired point clouds, this corresponds to the biggest cluster).
The region growing segmentation algorithm starts by sorting the points by their curvature and then selects as the first seed the point with the lowest curvature. Then, it keeps expanding the current cluster seeds by adding neighboring points that have an angle between the current seed normal and the neighboring point normal below a given threshold. After no more points can be added to the current cluster, a new cluster is initialized with a seed point that has the lowest curvature from the points that do not yet belong to a cluster. The algorithm for growing and creating new clusters keeps repeating until all the points are associated with a labeled cluster.
This segmentation algorithm was selected because the van support structure has a locally smooth surface with transition zones to the axle damper surfaces with large curvature differences. Moreover, the support structure has a surface area that is much higher when compared with the axle damper, allowing the segmentation selector to pick the cluster with the largest number of points.
After this procedure was executed for all the reference point clouds without the axle damper, the point clouds acquired with the axle damper were filtered by following the same steps as described above. Then, the registration of both point clouds was performed with different voxel grids (1 and 5 mm). The accuracy of the point cloud registration using the iterative closest point (ICP) algorithm [29] was measured by computing the RMSE, which was obtained by computing the Euclidean distance between corresponding points from the scan and reference point clouds. The RMSE was calculated for each registration [30], in which points with a corresponding reference point distance lower than a given threshold were marked as inliers.
The ICP algorithm aligns the sensor data with the reference point cloud by iteratively computing the 6-DoF matrix transformation that minimizes the RMSE of a given set of correspondences. For each iteration, every point in the sensor data is matched with the respective closest point in the reference point cloud. Points that have a correspondence distance higher than a given threshold are discarded from the list of correspondences to allow the algorithm to tolerate outliers. Then, the singular value decomposition (SVD) method is used to compute the 6-DoF transformation that minimizes the RMSE of the correspondences distances. The algorithm stops when the RMSE is below a given threshold or when the computed matrix has converged and stabilized, having a difference in relation to the previous iterations below specified translation and rotation thresholds. On the other hand, to bound its computation time, the algorithm can have an upper limit to its number of iterations and its maximum run time. At the end stage, the ICP algorithm returns the 6-DoF matrix that aligns and transforms the 3D sensor data into the reference point cloud, which corresponds to the sensor's 6-DoF pose in the reference point cloud coordinate system.
The evaluation relied on the Dynamic Robot Localization perception pipeline (https: //github.com/carlosmccosta/dynamic_robot_localization, accessed on 22 April 2023) for performing the point cloud registrations and computing the RMSE. The perception pipeline [31][32][33] uses filtering, segmentation and alignment algorithms from PCL and was developed with the robot operating system (ROS) [34]. Figure 5 summarizes the steps associated with the generation of the reference point clouds and the registration process.

Sensors Evaluation
As described in the previous section, the first step consists of multiple scans, with and without the axle damper mounted in the van. Figures 6 and 7 present samples of the point clouds generated by each sensor, without and with the axle damper mounted on the van. The point clouds without the axle damper were segmented to extract only the ROI and then used as the reference point cloud for the point cloud registration. Figure 8 shows an example of the whole point cloud segmentation and registration process. The orange point cloud is the result of the point cloud segmentation using the region growing algorithm. In the point cloud alignment result, the color of the points are associated with the corresponding distances between the reference and the new point cloud, with green indicating that the Euclidean distance between a pair of matched points is close to zero and red indicating that the distance between the points is higher than the maximum inlier distance.    Tables 2 and 3 summarize the registration results obtained using the sensor and the CAD model as reference point clouds, respectively, presenting the average RMSE of inliers and the average percentage of inliers in the alignment result. Although each side of the van had a different reference point cloud, resulted from a slightly different surface geometry, the assessment did not evaluate the sides separately, as the objective was to identify the sensor with the best overall results for both sides. As such, for each sensor, 12 scans were captured from the left side of the van and another 12 scans from the right side of the van. From these 24 scans for each sensor, the methodology presented in the previous subsection (with its overview in Figure 5) was used to compute the mean values and standard deviations for each metric presented in Tables 2 and 3.   Analyzing Tables 2 and 3, it is possible to verify that the alignment results (RMSE and inliers percentage) were better when using reference point clouds based on a previous scan performed by the respective sensor instead of using CAD models. This was expected since the production of the van structure has some deviations and tolerances in relation to the CAD model. Moreover, registering a new point cloud with a previously captured and filtered scan can be used to evaluate the repeatability of both the sensor and the alignment algorithms. On the other hand, the RMSE difference when comparing the usage of a reference point cloud using CAD models or scans is less significant when using a bigger voxel grid (5 mm) since the voxel grid replaces all the points within a cell with their mean XYZ value. This can result in the absorption of the van structure production tolerances and the sensor measurements noise but can also raise the mean RMSE if the reference and scan voxel grids do not have overlapping coordinate systems, resulting in an offset between the cells that grows as the voxel size increases.
Focusing solely on the results achieved when using the sensor-based reference point cloud and the voxel grid of 1 mm, the relative difference between the RMSE of the point clouds captured by each sensor is more clear. Namely, the lower RMSEs were 0.25 mm, 0.49 mm and 1.01 mm when using the SICK TriSpector1030, Photoneo PhoXi S and Asus Xtion Pro Live, respectively. Additionally, the percentages of inliers were 99%, 92% and 88%, respectively. Moreover, no significant difference was found when varying the maximum inlier distance. The difference between sensors was lower when using a voxel grid of 5 mm. The lower RMSE was 1.30 mm, 1.40 mm and 1.49 mm, with a maximum inlier distance of 2 mm when using the SICK TriSpector1030, Photoneo PhoXi S and Asus Xtion Pro Live, respectively. In this case, there were significant differences when varying the maximum inlier distance. The RMSE decreases with a smaller maximum inlier distance; however, the percentage of inliers decreases as well. Although the RMSE was smaller, the value refers to a smaller number of corresponding points.
As detailed in Table 1, the depth error listed in the SICK TriSpector1030 technical specifications is smaller than the other two sensors, and this specification is reflected in these results. This sensor provides the best alignment results when using both types of reference point clouds and when varying the voxel grid and the maximum inlier distance. Considering the voxel grid of 1 mm, the RMSE was always below 1 mm with a percentage of inliers above 85% even when using the CAD point cloud as the reference model. The RMSE increased above 1 mm when changing the voxel grid for 5 mm, but, overall, the SICK TriSpector1030 sensor performed better than the other sensors.
We were also able to achieve an RMSE below 1.00 mm (around 0.50 mm) with the Photoneo PhoXi S using a sensor-based reference point cloud with a voxel grid of 1 mm. Overall, the Photoneo PhoXi S performed worse than the SICK TriSpector1030 but better than the Asus Xtion Pro Live. In general, the Asus Xtion Pro Live generated the worse results, with an RMSE always above 1 mm. This was mainly related to the lower quality of the captured point cloud, which had less accuracy and higher sensor noise.
The lower RMSE achieved by the SICK TriSpector1030 was likely due to the usage of camera lens filters that block all light with the exception of light frequencies associated with the laser line. This way, the SICK TriSpector1030 will have better repeatability since the camera sensor will have less pixel noise when compared with the other two sensors, which capture light from a much wider frequency range. On the other hand, by being a line triangulation system, SICK can also employ subpixel algorithms to estimate the center of the detected laser line, further increasing its precision and repeatability.
The time needed to process the registration and pose estimation was lower when using a bigger voxel grid (5 mm) since the point cloud was less dense (had fewer points), when compared with a point cloud generated with a smaller voxel grid (1 mm). Additionally, due to the usage of voxel grids, the density of the point clouds from each sensor was similar, which resulted in similar processing times. When considering a smaller voxel grid, there was a noticeable difference in the processing time since the density of the point clouds was higher, with a processing time proportional to the number of points registered. As described in Tables 2 and 3, the processing time was higher when using the Photoneo PhoXi S sensor because the raw point clouds from this sensor had much higher number of points than the other sensors.

Machine Vision System for Fastening Operations
Taking into account the experimental results presented earlier and the use case requirements, such as the process cycle time limit for fastening the two axle dampers (which must be under 160 s) and the 3 mm accuracy, as well as the cost of each individual sensor, we implemented the automated axle damper fastening system with the SICK TriSpector1030 sensor. Namely, the fastening platform was equipped with the following:

1.
One Doosan H2017 robotic arm with a range of 1700 mm and payload of 20 kg. 2.
One SICK TriSpector1030 sensor.

3.
Two screwdrivers attached to the end effector of the robotic arm.
The robot base was placed centered in relation to the left and right axle dampers locations (as seen in Figure 9) for ensuring that the platform could perform the desired fastening operations on both axle dampers without needing to move its base. The approach implemented to determine the pose of the axle damper attachment structure was as follows: 1.
Setup phase: (a) Robotic arm moves the 3D sensor to scan the ROI of the van without the axle damper (reference point cloud).
The pose of the attachment structure with respect to the robot is determined.

2.
Operation phase performed for each side of the van: (a) Operator places the axle dampers on a new van.
Robotic arm moves the sensor to capture a new scan of the ROI of the van with the axle damper (seen in Figure 9c). (c) The sensor point cloud is aligned with the reference point cloud using the ICP algorithm. (d) The point cloud alignment is validated for ensuring that it has a minimum percentage overlap between the reference point cloud and the new scan. (e) The transformation matrix between the robot base and the axle damper attachment structure is computed. (f) Robotic arm moves the screwdrivers and performs the bolt-tightening operations (shown in Figure 9d).
The machine vision was deployed at the van assembly workstation at the factory, where it was tested during two weeks (split into two shifts of seven hours and twenty-five minutes each). At the screwing workstation, the vehicles stopped at 160 s. Table 4 presents the performance results obtained from these trials.

Conclusions
The usage of 3D sensors to tackle perception challenges keeps expanding since they are able to provide accurate depth information and have fewer limitations regarding the environment light conditions when compared with 2D cameras. This paper presented a comparison of three depth sensors to evaluate which one is more suitable for proving 3D data for a machine vision system that estimates the 6-DoF pose of the attachment structure of axle dampers in the undercarriage of vans. The sensor analysis was focused on comparing the suitability of the sensors when using as metrics the RMSE and overlap percentage computed after point cloud registration. For the described use case, the results indicate that the SICK TriSpector1030 is the most appropriate, given its lowest RMSE (0.25 mm ± 0.03 mm) and highest overlap percentage (99%), followed by the Photoneo PhoXi S, with RMSE (0.49 ± 0.14). On the other hand, for applications that require less accuracy and in which the low cost is an important factor, the Asus Xtion Pro live can also be a feasible option, given that it achieved an acceptable RMSE (1.01 mm ± 0.11mm). After deploying the machine vision with the SICK TriSpector1030 on an assembly line, the automated bolt-tightening system was able to achieve 94% of success in 53,400 screwing operations, with an operation time below 80 s.
Future work may include the comparison of other 3D sensors along with the inclusion of other perception algorithms that rely on fusion between 2D and 3D data for increasing the reliability of the segmentation stage for other use cases. Funding: This work has received funding from the ERDF-European Regional Development Fund, through the North Portugal Regional Operational-NORTE 2020 programme under the Portugal 2020 Partnership Agreement within project EuroBot, with reference NORTE-01-0247-FEDER-072550.