Visual sensor fusion for active security in robotic industrial environments

This work presents a method of information fusion involving data captured by both a standard charge-coupled device (CCD) camera and a time-of-ﬂight (ToF) camera to be used in the detection of the proximity between a manipulator robot and a human. Both cameras are assumed to be located above the work area of an industrial robot. The fusion of colour images and time-of-ﬂight information makes it possible to know the 3D localization of objects with respect to a world coordinate system. At the same time, this allows to know their colour information. Considering that ToF information given by the range camera contains innacuracies including distance error, border error, and pixel saturation, some corrections over the ToF information are proposed and developed to improve the results. The proposed fusion method uses the calibration parameters of both cameras to reproject 3D ToF points, expressed in a common coordinate system for both cameras and a robot arm, in 2D colour images. In addition to this, using the 3D information, the motion detection in a robot industrial environment is achieved, and the fusion of information is applied to the foreground objects previously detected. This combination of information results in a matrix that links colour and 3D information, giving the possibility of characterising the object by its colour in addition to its 3D localisation. Further development of these methods will make it possible to identify objects and their position in the real world and


Introduction
Since the 1960s, industrial robots have been used in the manufacturing industry and they have substituted humans in various repetitive, dangerous, hostile tasks. A consequence associated with the incorporation of robots in industry is the emergence of new risks of accidents for workers. The normatives which incorporate, among many other aspects, these robot-related risks, include the international standard ISO 10218, the American ANSI/RIA R15.06, the European EN 775, and national normatives such as the Spanish UNE-EN 755. To prevent accidents, the selection of a security system must be based on the analysis of these risks. Traditionally, these security systems separate the robot workspace from the human one. One example of this requirement was reflected in the Spanish normative UNE-EN 755:1996 [1]. It is established that sensor systems have to be incorporated to prevent the entrance of humans in a hazardous area in case the operating state of the robotic system implies dangers to the human. According to traditional normatives, maintenance, repair, or programming personnel can only be inside the robot workspace if the industrial robot is not in automatic mode.
However, in recent years, due in part to the flexible design of products, the optimization of production methods, and the introduction of new technologies, the tasks performed by industrial robots are no longer restricted to the transfer of objects, or other repetitive tasks. Instead, there is an increasing number of tasks in which humans and robots combine their skills in collaborative work.
To enable collaboration between human and robot, safety measures that establish a rigid separation between human and robot workspaces have to be removed. Instead, the introduction of other types of security systems is required so that collisions can be avoided by detecting obstacles as well as their dynamic characteristics, and harm to the human can be mitigated in case of an unexpected impact. For this reason, research in this field is directed towards changing the way a human interacts with a robot so that the trend is that both human and robot can share the same workspace at the same time. This change in the working relationship is reflected in the updates carried out from the year 2006 in the international normatives ISO10218 [2] and guidelines for the implementation of these regulations, such as [3]. In these guidelines, new concepts are presented, such as collaborative robots, collaborative operations, and spaces of collaborative work.
Taking into account that security is a fundamental aspect in the design of robotic manufacturing systems, the development of systems and security strategies that allow safe collaborative work between human and robot is essential. The aim of this paper is to contribute at the initial stage of the design of a system for collision prevention between a human and a robot manipulator sharing a workspace at the same time. A method for processing of information acquired from two different types of vision sensors located above an industrial robot environment is proposed. The method, which is mainly focused on information captured from a time-of-flight camera, allows the fusion of both colour and 3D information, as an initial step towards the development of an active security system for application in an industrial robotics environment. This information fusion generates a colour and 3D information matrix which allows simultaneously estimating colour characteristics from an object and its three-dimensional position in a world coordinate frame. At a later step, the use of this combination of information will allow to associate a security volume around each characterised object, in order to prevent possible collisions between industrial robot and human.

Related work on shared human robot workspaces
A brief summary of different types of security applied to industrial robotic environments is provided in order to give the context to the work presented in this paper. With the aim of giving context to the work presented in this paper, Figure 1 presents a possible classification of these types of security, as well as goals to achieve for each type of security, systems and devices used, and actions to apply on the robotic system.

Figure 1 Security systems industrial robot environments.
This scheme aims to summarize a classification of the types of security applied to industrial robotic environments, as well as goals to achieve for each type of security, systems and devices used, and the action to apply on the robotic system.
Security systems in industrial robotic environments can be classified as passive and active. Passive security systems are hazard warning elements which do not alter the robot behaviour. These systems are audible or visible signals such as alarms or lights or systems that prevent the inadvertent access to a restricted area. Active security systems in industrial robotic environments can be defined as the methods used to prevent the intrusion of humans to the robot workspace when it is in automatic mode. The difference with the passive methods is that active methods can modify the robot behaviour. Historically, devices such as movement, proximity, force, acceleration, or light sensors are used to detect human access to the robot workspace and to stop the execution of the robot task. However, as it has been discussed previously, research in this field is moving towards allowing humans and robots to share workspaces.

Collision avoidance
A further way to enhance safety in shared human/robot work/workspaces is to implement collision avoidance systems. Robots have been provided with sensors capturing local information. Ultrasonic sensors [4], capacitive sensors [5,6], and laser scanner systems [7] have been tried to avoid collisions. However, the information provided by these sensors does not cover the whole scene, and so these systems can only provide a limited contribution to enhance safety in human-robot collaboration tasks [8]. Moreover, geometric representations of human and robotic manipulators have been used to obtain a spatial representation in human-robot collaboration tasks. Numerical algorithms are then used to compute the minimum distance between human and robot and to search for collision-free paths [9][10][11][12]. Methods have been proposed involving the combination of different types of devices to help avoid collisions. This idea has been applied into a cell production line for component exchange between human and robot in [13], where the safety module uses commands from light curtain sensors, joint angle sensors, and a control panel to prevent the collision with the human when exchanging an object. The discussion concentrates below in artificial vision systems, range systems, and their combination.

Artificial vision systems
Artificial vision systems have also been used to prevent human-robot collisions. This information can be used on its own or in the combination with information from of others types of devices. In order to achieve safe human-robot collaboration, [14] describes a safety system made up of two modules. One module is based on a camera and computer vision techniques to obtain the human location. The other module, which is based on accelerometers and joint position information, is used to prevent an unexpected robot motion due to a failure of robot hardware or software. Research work such [15] investigates safety strategies for human-robot coexistence and cooperation. The use of a combination of visual information from two cameras and information from a force/torque sensor is proposed. In order to perform collision tests, other work has used visual information acquired by cameras [16,17] to generate a 3D environment. Also, visual information is used to separate humans and other dynamic unknown objects from the background [18] or to alter the behaviour of the robot [19]. In [20][21][22], visual information has been used to develop safety strategies based on fuzzy logic, probabilistic methods, or the calculation of warning index, respectively.

Range systems
The depth map of a scene can be obtained by using depth sensors such as laser range finders and stereo camera systems. The results of using a laser time-of-flight (ToF) sensor are presented in [23] and [24] with the latter using several depth sensors in combination with presence sensors. Recently, a new type of camera has become available. These cameras, denominated as range-imaging cameras, 3D ToF cameras, or PMD cameras, capture information providing a 3D point cloud, among other information. They are starting to be used in active security systems for robotic industrial environments, among other applications. An example is a single framework for human-robot cooperation whose purpose is to achieve a scene reconstruction of a robotic environment by markerless kinematic estimation. For example, [8,25] use the information delivered by a 3D ToF camera mounted to the top of a robotic cell. This information is employed with the purpose of extracting robust features from the scene, which are the inputs to a module that estimates risks and controls the robot. In [26], the fusion of 3D information obtained from several range imaging cameras and the application of the visual hull technique are used to estimate the presence of obstacles within the area of interest. The configurations of a robot model and its future trajectory along with information on the detected obstacles are used to check for possible collisions.

Combination of vision and range systems
This technique is based on the combination of 3D information from range cameras and 2D information from standard charge-coupled device (CCD) cameras. Although this technique is being used in other applications, such as hand following [27,28] or mixed reality applications [29][30][31], not much work has been reported using this technique in the area of active security in robotic environments. In [32], an analysis of human safety in cooperation with a robot arm is performed. This analysis is based on information acquired by a 3D ToF camera and a 2D/3D Multicam. This 2D/3D Multicam consists of a monocular hybrid vision system which fuses range data from a PMD ToF sensor, with 2D images from a conventional CMOS grey scale sensor. The proposed method establishes that while the 3D ToF camera monitors the whole area, any motion in the shared zones is analysed using the 2D/3D information from the Multicam. In [33], a general approach is introduced for surveillance of robotic environments using depth images from standard colour cameras or depth cameras. The fusion of data from CCD colour cameras or from ToF cameras is performed to obtain the object hull and its distance with respect to the known geometry of an industrial robot. They also present a comparison between distance information from colour and ToF cameras and a comparison between a ToF camera and ToF information fusion. One of the conclusions of this work is that the fusion of information from several ToF cameras provides better resolution and less noise than the information obtained from a single camera. Finally, [34] describes a hybrid system based on a ToF camera and a stereo camera pair which is proposed to be applied in humanrobot collaboration task. Stereo information is used in unreliable ToF data points to generate a depth map which is fused with the depth map from the ToF camera. Colour feature is not taken into account. On the other hand, nearly a decade after that ToF cameras emerged into the industrial trade [35], a new type of 3D sensors (RGB-D sensors), which are fitted with a RGB camera and a 3D depth sensor, were launched for non-commercial use [36]. The RGB-D sensor has several advantages over ToF cameras such as higher resolution, lower price, and the availability of depth and colour information. Hence, its study and application have been objective of research work such as [37] that presents a review of Kinect-based computer vision algorithms and applications. Several topics are presented like preprocessing tasks including a review of Kinect recalibration techniques, object tracking and recognition, and human activity analysis. These authors propose in [38] an adaptive learning methodology to extract spatiotemporal features, simultaneously fusing the RGB and depth information. In addition to this, a review of several solutions to carry out information fusion of RGB-D data is presented. Also, a website for downloading a dataset made of RGB and depth information for hand gesture recognition is introduced. Related to active security system in industrial robotic environments, the use of the Kinect sensor is being incorporated as it is shown in [39] where a real-time collision avoidance approach based on this sensor is presented.

Method for the fusion of colour and 3D information
The presented method for fusion of acquired information from a ToF camera and a colour camera has a different standpoint from the ones proposed in the consulted papers. According to papers that are not related to active security in robotic industrial environments such as [27], the spatial transformation is performed establishing the ToF camera coordinate system as the reference coordinate system. Therefore, if an object position in a world coordinate system wanted to be known, another calibration should be done to establish the rotation matrix and translation vector that connected both coordinate systems. Nevertheless, in the present paper, this aspect has been considered. Therefore, it was needed to define a common coordinate system for an industrial robot, a colour camera, and a ToF camera, in order to know at the same time 3D object location at the robot arm workspace and its colour feature. According to papers focusing on mixed reality applications as paper [29], the used setup includes a CCD firewire camera, a ToF camera, and a fisheye camera. After performing the calibration and establishing relative transformations between the different cameras, a background model, whose use eliminates the need for chroma keying and also supports planning and alignment of virtual content, was generated allowing to segment the actor from the scene. Paper [31] presents a survey of ToF basic measurement principles of ToF cameras including, among other issues, camera calibration, range image preprocessing, and sensor fusion. Several studies which study different combinations of high-resolution cameras and lowerresolution ToF cameras are mentioned.
In relation to the paper focused on active security, the most closely related to our work is [32]. Though a common world coordinate system for cameras and robot is also used, the method seem to present certain differences because a spatial transform function is identified in order to map the image coordinates of the 2D sensor to the corresponding coordinates of the PMD sensor. Moreover, saturated pixels errors do not seem to have been considered. Here, the presented work shows a different standpoint since the obtained parameters from the cameras calibration are used to transform 3D point cloud given in the ToF camera coordinate system to the world coordinate system, and finally, the obtained internal and external parameters are used to achieve the reprojection of corrected 3D points (distance error, saturated pixels, and jump edge effect) into colour images.
With the aim of allowing any researcher to implement the proposed method of fusion of information exactly like it that has been carried out at the present work, this paper gives a mathematical detailed description of the steps involved in the proposed method.
In what follows, it is assumed that a 3D ToF camera and a colour camera are fixed and placed over the workspace of a robot arm and that the fields of view of both cameras are overlapped. Also, it is assumed that external temperature conditions are constant, and that the integration time parameter of the 3D ToF camera is automatically updated at each data acquisition. Image and 3D data from the scene is captured and processed as described in the next sub-sections. Assume that the ToF camera has a resolution n x × n y and that the CCD camera has a resolutionn x ×n y .
In what follows, vectors and matrices are denoted by Roman bold characters (e.g. x). The jth element of a vector x is denoted as x j , element (i, k) of a matrix A is denoted as A i,k , a super-index in parenthesis (j) denotes a node within a range of distances, a sub-index within square brackets such as [i] denotes an element of a set.

Reduction TOF range camera errors
The reduction of range camera errors is a fundamental step to achieve an acceptable fusion of colour and 3D information. The existence of these errors cause the fused information to have issues that range from minor, such as border inaccuracy, to serious such as the loss of information in saturated pixels coordinates.

Distance error reduction
As it is well documented that ToF cameras suffer from a non-linear distance error, several experiments have been developed in order to model and correct the distance error (or circular error) [35,[40][41][42][43]. With the purpose of decreasing the influence of this error in distance measurements, a procedure is described below to correct the ToF distance values based on a study of the the behaviour of the camera. This study requires a ToF camera to be positioned parallel to the floor, and a flat panel of light colour and low reflectance, to be mounted on a robot arm. The panel position is also parallel to the floor. The robot arm allows to displace the panel along a distance range and ToF data at different distances can be captured.
The distance error analysis from the acquired data can be performed in two ways: a global analysis of all the pixels without taking pixel position into account and an analysis which takes into account the position of each pixel. The first analysis is easier to perform as it only requires a relatively small panel; it is assumed that there is no error due to pixel localization and only a reduced region of the 3D ToF data is analysed. The second analysis can be carried out to check the suitability of the assumption of negligible error due to pixel localization of the first analysis. The second analysis requires a larger panel, as the distance image captured by the camera has to be based only on the panel for different distances. Both methods are described in the steps below.
1. Image capture. Since distance measurements are influenced by the camera internal temperature, a minimum time period is necessary to obtain stable measurements [43]. After the camera warms up, ToF information is captured at each of the P different nodes in which the distance range D was divided. Each captured data is defined by an amplitude matrix A of dimensions n x × n y , and 3D information made up of three coordinates matrices X, Y, and Z, each one of dimensions n x × n y . In order to generate a model of distance error, a set Z T of distance information in the z axis is formed by capturing N images at each node j, with j = 1, . . . , P . Similarly, sets of distance information for training are defined for the x and y axes, which are denoted as X T (j) and Y T (j) , respectively. In order to validate the model so obtained, of distance information is also formed by capturing M additional images at each node j, with j = 1, . . . , P . Similarly, sets of distance information for validation are defined for the x and y axes, which are denoted as X V (j) and Y V (j) , respectively.
In this article, the sets of information Z T and Z V are also called T oF distance images and are defined as 2. Angle correction. Correction angles are applied to the ToF information sets for each axis x, y, and z, with the aim of compensating for any 2D angular deviation between the the (x, y) plane of the range camera and the plane defined by the floor. This 2D angular deviation is denoted by the angles θ x and θ y . This correction allows obtaining parameter values as if both camera and panel were perfectly parallel.
Given an x axis distance image X T of dimensions n x × n y , define its sub-matrixx of dimensions n 1 × n 2 , where n 1 < int(n x /2) and n 2 < int(n y /2), as a matrix formed such that its top left elementx 1,1 corresponds to element X T ic,jc . Index i c is chosen as int(n x /2), and index j c is chosen as int(n y /2). Similarly, sub-matricesŷ andẑ are defined for axes y and z, respectively. Definex,ȳ, andz as the column-wise vectorised forms of sub-matricesx,ŷ,ẑ, each with dimension n × 1, where n = n 1 n 2 , with n as the number of pixels from the selected area. This central region is taken from each ToF distance image to estimate and correct the 2D angle inclination between the panel and the ToF camera. Hence, for each image region, 3D points are modified using the rotation matrices R x and R y : where G has dimensions 3 × n. The transformed image region for the z coordinate is obtained from the rows of G: and in this way, a vectorz ′ of dimensions n × 1 is defined.
A second rotation transformation is applied around the y axis such that The transformed image region for the y coordinate is obtained from the rows of H: wherez ′′ is of dimension n × 1. Since the above rotation causes a displacement of the 3D points along the y axis, theȳ vector is used to represent ToF information after angle correction. Then, in this way, the 3D ToF vectors after angle correction arex,ȳ,z ′′ , each one of dimensions n × 1.
3. If the pixel position is not considered, then: (a) Discrepancy curve calculation stage. In order to test the angle correction effect over the distance error, the same procedure is applied using data before and after angle correction. However, the method is described using data after angle correction. The selected area is used to calculate several parameters including the mean distance value, discrepancy distance value, and mean squared error (MSE). Define a set of distances after angle correctionZ ′′(j) = z ′′(j) [1] ,z ′′(j) [2] , . . . ,z at each node j, with j = 1 . . . P . The mean distance ToF over the selected area in all ToF distance images,Z j , at each node j, is calculated by means of:Z where the resultingZ is a vector with dimensions P × 1.
Defining L j as a distance value obtained by a laser distance meter at each node j (henceforth this value is treated as ground truth), and a vector L = [L 1 , . . . , L P ] T , with dimensions P × 1. Then, the discrepancy distance vector, δ d , is calculated as the difference between the mean distance from the ToF camera after angle correction,Z, and the ground truth vector L: In order to obtain correction values to be applied in new ToF distances images, a cubic spline is used for fitting this discrepancy information for each distance. The cubic spline is modelled as a function s that passes through all the points (Z,δ d ) and at each interval [Z j ,Z j+1 ] and is expressed as a polynomial.
where j = 1, . . . , P − 1. For each sub-interval, the coefficients a 0 , a 1 , a 2 , a 3 are calculated so that the curve passes through the points (Z j , δ d j ) and (Z j+1 , δ d j +1 ) [44]. The resulting spline, henceforth called the discrepancy curve, allows to estimate the discrepancy correction value, given a ToF distance.
(b) Discrepancy correction. In order to reduce the errors in the distance estimates obtained from the ToF information, the set of ToF distance images for validation Z V is used to validate the discrepancy curve. To this end, a vector of validation ToF distance images after angle correctionz ′′ v (dimension n × 1) is defined and evaluated on the discrepancy curve to obtain the vector of correction values C (dimension n × 1). Then, the corrected distance value for a distance image after its angle correction is calculated as follows: DefineZ (j) = z (j) [1] ,z (j) [2] , ...,z [M ] as a set of distances after discrepancy correction for each node j, with j = 1 . . . P , the mean value after discrepancy correction for the M ToF distance images obtained at each j node is calculated as follows: with j = 1 . . . P and where the resultingZ is a vector with dimensions P × 1.
In order to observe the effect that these corrections have over the 3D ToF points, the MSE can be calculated before and after the discrepancy correction. Defining for each node j a vector with the corresponding laser distance meter values L ′(j) = [L (j) , . . . , L (j) ] T with dimension n × 1 (treated here as ground truth), then the mean squared error at each pixel k and for node j can be calculated as where . is the euclidean norm, N ′ is the number of ToF distance images used, Z is a vector of ToF distance values that can be substituted by the angle corrected vectorz ′′ of each distance image, or by the discrepancy corrected vectorz of each distance image, each one with dimension n × 1, and with j = 1, . . . , P . The set of MSE (j) k values for k = 1, . . . , n gives an indication of the planar distribution of the distance error for a given node j. Then, for a given node j, it is possible to average the mean square errors to obtain an indication of the error depending on the node position 4. If the position of each pixel is taken into account, then: (a) Discrepancy curves calculation stage. Using the N angle corrected ToF distance images represented byz ′′ , a discrepancy curve is calculated for each pixel at each distance node. At this stage, using N images at each node j, the mean value of each pixel k, where k = 1, . . . , n, is calculated as follows:V where the resultingV, whose elements are the valuesV k,j , is a matrix with dimensions n × P . Define a new matrix L ′′ of dimension n × P which is obtained by replicating n times the laser distances vector L T as follows: . . . L (P ) [1] . . . . . . . . . L [n] . . . L Then, the discrepancy distance vector δ v for all the j nodes is calculated for each pixel k = 1, . . . , n as the difference between the mean distance from the ToF camera after angle correction,V, and the ground truth distance vector L ′′ obtained using a laser distance meter: with δ v of dimension n × P . In order to obtain n correction values to be applied to any new ToF distances images, a cubic spline is calculated to fit this discrepancy information along the distance range for each pixel. The cubic spline is modelled at each pixel k using Equation 8 and the data points (V, δ v ).
(b) Correction using a discrepancy curve at each pixel. In order to reduce the errors in the ToF distances images, the set of ToF distances images Z V is used to validate each discrepancy curve at each pixel. To this end, each pixel k of the validation vector after angle corrections z ′′ v (dimension n × 1) is evaluated on its discrepancy curve to obtain the vector of correction values C v (dimension n × 1). Then, the corrected distance vectorv (dimension n × 1) is obtained using the expressionv DefineV (j) = v (j) [1] ,v (j) [2] , ...,v [M ] as the set of distances after discrepancy correction, where the mean value at each pixel k for each j node is calculated as follows: with k = 1, . . . , n, j = 1, . . . , P , and where the resultingV is a matrix with elementsV k,j and dimensions n × P of mean ToF distances values at each pixel for each node. The mean squared error is obtained by means of Equation 11, where Z is replaced by the corrected valuesv.
A comparison of the MSE values for discrepancy corrected and non-corrected measurements gives a measure of improvement in accuracy due to the discrepancy correction. If no such improvement is detected, then it is recommended to revise the experimental conditions as this may indicate the existence of problems with the experiment.

Correcting the values of saturated pixels
Information from range cameras can be affected by pixel saturation, which is caused by an excessive reflectance of light over objects. Though its effect can be reduced by an automatic updated of the integration time parameter of the ToF camera [31], in some circumstances like the presence of metal or reflecting paints, this tool is not enough.
The saturation of range camera information affects the amplitude and distance values returned by the range camera. These values are very different from the remaining pixel values of the scene. The proposed strategy to detect saturated pixels is based on this fact, and an analysis of amplitude signal is made. The method has two stages.
1. Looking for saturated pixels. According to [45], pixel saturation occurs when the amplitude values are greater than a given threshold value ζ, which depends on the camera being employed. Hence, the amplitude image is searched for values greater or equal than this value in order to generate a saturation binary mask M with ones at the positions of the saturated pixels and zeros elsewhere.
To be able to perform the correction on pixels located at the edges of the image, the amplitude and 3D information matrices are augmented by replicating rows and columns located at the edges of the matrix. Define p as the number of rows and columns of A to be replicated. Define the p upper rows of A as B i,j = A i,j , such that B is of dimension p × n y , where i = 1, . . . , p and j = 1, . . . , n y , and the p lower rows as B ′ i,j = A nx−i+1,j , such that B ′ is of dimension p × n y , where i = 1, . . . , p, and j = 1, . . . , n y . Define the intermediate matrixÂ as follows: whereÂ is matrix of dimension 2p+n x ×n y . Then, define the left p columns of A as B ′′ i,j = A i,j , such that B ′′ is of dimension 2p+n x ×p, where i = 1, . . . , 2p+n x and j = 1, . . . , p and the p right columns as B ′′′ i,j = A i,ny−j+1 such that B ′′′ is of dimension 2p+n x ×p, where i = 1, . . . , 2p+n x and j = 1, . . . , p. Then, the augmented amplitude matrixÃ of dimensions 2p + n x × 2p + n y is given by: To represent saturated pixels inÃ, the binary maskM matrix of dimensions 2p + n x × 2p + n y , is defined byM where i = 1, . . . , n x + 2p, j = 1, . . . , n y + 2p.

2.
Correction of saturated pixels. In order to replace an incorrect value with the average of its neighbours, the saturation binary mask is used to find the coordinates of saturated values in the amplitude and 3D matrices and to calculate the mean value of surrounding pixels. Saturated values are not taken into account in this calculation. Define a window-maskM i,j =M r−p+i−1,c−p+j−1 , with i = 1, . . . 2p + 1 and j = 1, . . . , 2p + 1, of dimensions 2p + 1 × 2p + 1, whose center is each saturated pixel with position (r, c) ∈ Q. In order to calculate a new pixel value to replace a saturated pixel value, define a window of amplitude valuesÅ i,j =Ã r−p+i−1,c−p+j−1 , with i = 1, . . . 2p + 1 and j = 1, . . . , 2p + 1, of dimensions 2p + 1 × 2p + 1, whose center corresponds to each saturated pixel with position (p + 1, p + 1). The new valueÃ r,c for each saturated pixel (r, c) ∈ Q is calculated as With the aim of selecting and replacing values in the amplitude/3D information matrices, Figure 2 shows an example of the movement of a search window obtained from the binary saturation mask.
Define (X, Y) as the initial ToF data,z as the distance TOF data after discrepancy correction, and using the index set Q of amplitude saturated values, a similar procedure to correct the corresponding values of these matrices is applied, obtaining matrices (X,Ỹ,Z), as these values are affected by the amplitude saturation. Once saturated pixels are corrected, all matrices are resized to their initial dimensions by removing the rows and columns previously added, which results in matrices X ′ , Y ′ , Z ′ , and A ′ .

Figure 2 Mask for saturated pixel reduction.
Example of using mask for saturation pixel reduction.

Jump edge reduction
Another error that may affect the 3D data from a range camera is known as jump edge. This error produces spurious pixels which are 3D inaccurate measures of the real scene. In order to reduce this effect, the use of a median filter followed by a jump edge filter based on a local neighbourhood is proposed in [46]. Other solutions which implement non-local means filter or edge-directed re-sampling techniques are enumerated in paper [31]. In the present work, the use of 2D techniques applied to 3D points is proposed to prevent border inaccuracy in fused information. Traditionally, the technique of morphological gradient is used in grey scale images to emphasize transitions of grey levels [47,48]. In this work, only distance values from 3D data are used, generating a distance image. With the objective of finding pixels suffering from this effect, the morphological gradient is calculated, using the following expression [48]: where g is of dimension n x × n y , f is a ToF distance matrix of same dimension as g, S is a 3 × 3 generalised dilation or erosion mask, and ⊕ and ⊗ are dilation and erosion operations, respectively.
A threshold value to discriminate non-desirable pixels from the remaining ones is then searched. With this aim, the distance image g is transformed into a new distance image G with values ranging from 0 to 255, by means of the following transformation: After that, the histogram of G is calculated and then smoothed by means of a Butterworth filter. Finally, a threshold value η is defined by searching along the smoothed histogram for the first minimum to the right of the first maximum. A new distance matrix f ′ is generated by forcing to zero spurious pixels which are found and keeping the same distance values for the remaining pixels: When performing the fusion of ToF and colour information, jump edge reduction is carried out after scaling up the ToF information, as discussed below.

Colour and 3D information fusion
Information fusion from a standard CCD camera and a ToF camera allows the simultaneous use of 3D and colour information. This can be achieved by means of the reprojection of 3D ToF points into a colour image. In an active security system, moving objects, such as robots and humans, have to be detected to prevent possible collisions between them. To obtain information about these objects and develop the algorithms that make it possible to avoid collisions, the foreground detection is carried out in such way that the fused information is obtained only through those pixels classified previously as foreground pixels. The foreground object detection in a scene is carried out using 2D techniques over 3D ToF points, and subsequently, colour and 3D information from foreground objects is fused.

3D information analysis for detecting foreground objects
Background subtraction methods for detecting moving objects have been proposed, analysed, and employed to locate object motion in a 2D image sequence [49][50][51]. In this work, for the purpose of motion detection in 3D point cloud, and considering that ToF camera is static, and illumination changes do not affect the acquired 3D points, the background subtraction technique has been considered suitable to be adapted and applied to three-dimensional information. Therefore, after performing distance and saturated pixel correction, a background subtraction method based on the reference image model is adapted to be used in a 3D point cloud. The goal is to discriminate the static part of the 3D scene from the moving objects, so an offline background reference image B T is calculated as the average image during a time period T = 1, . . . , t. Define a set of t ToF distance images after discrepancy and pixel saturation correction captured in a time period T , such that Z ′ = {Z ′ 1 , Z ′ 2 , . . . , Z ′ t }, then, the background reference image is calculated as where n is the number of pixels in each ToF distance image. With the aim of detecting pixels that show motion, the difference image Z ′ d between the reference and a current image Z ′ c is calculated as : where | · | indicates an element-wise absolute value operation.
Foreground detection is performed in those pixels whose distance value, Z ′ d , exceeds a threshold value, T h , which results in a binary image Z ′ b . In order to automatically determine T h , the distance matrix Z ′ d is processed as if it was 2D information by means of Equation 24, where g is replaced by Z ′ b , resulting in a grey scale image G ′ . Then, the calculation of the smoothed histogram of G ′ and the search for threshold value are carried out in a similar way as presented in the 'Jump edge reduction' section. The binarisation process to detect pixels that show motion is given by In the resulting binary image, isolated pixels are removed using morphological operations (dilation, hole filling, and erosion). This enhanced binary image is used as a mask over the 3D points of Z ′ to set the maximum value to the coordinate of 3D points whose coordinates in the binary image are considered as background (0 value) and to leave as real Z values those 3D points whose coordinates in the binary image are considered as foreground (1 value), then a new ToF distance matrix Z ′′ is obtained. Figure 3 illustrates this method for the background and foreground 3D value assignment and selection.

Figure 3 Selection of foreground Z values.
Method for the background and foreground 3D assignment and selection using a binary image obtained by using the image reference method.

Reprojection of 3D ToF information into a colour image
With the aim of giving additional colour information to the 3D foreground points previously detected, the reprojection of these points into a colour image was carried out. Using colour and amplitude images, both cameras are calibrated with respect to the world coordinate frame. Since both cameras can be represented by the pinhole camera model [42,52], a tool such as the Camera Calibration Toolbox for Matlab [53] can be used to extract internal and external parameters for both cameras. External parameters are used to transform 3D ToF information given in the camera coordinate system into the world coordinate system. On the other hand, internal an external parameters are used to reproject 3D information into colour images. Hence, based on calibration camera theory [48,54,55] and after the range camera error reduction, the reprojection process is applied over the corrected and transformed 3D points following the transformations described below.
The transformation of ToF information after discrepancy and saturation corrections and foreground detection from world frame coordinates P w = [X ′ , Y ′ , Z ′′ ] T to camera frame coordinates P c = [X c , Y c , Z c ] T is given by where extrinsic parameters are expressed by the 3 × 3 rotation matrix R and by the 1 × 3 translation vector T.
Frequently, standard CCD colour cameras have a higher resolution than range cameras, so the reprojection of 3D points does not have a one-to-one equivalence. Hence, the ToF information is scaled up by bilinear interpolation. In addition to this, as only information of foreground 3D points will be extracted, the automatic thresholding process is applied to the 3D points P c in order to remove those points classified as background, which results in a new 3D point cloud Image coordinates are affected by tangential and radial distortions; therefore, the models of this systematic distortions are added to the pinhole model following the method proposed in [55]. The transformation between a three-dimensional coordinate frame and the image coordinate frame without distortion (x u , y u ) is given by where the intrinsic parameter f is the focal length in millimetres.
The relation between image coordinates with (x d , y d ) and without distortion (x u , y u ), considering the radial D (r) , and tangential D (t) distortions are defined by pinhole model as The transformation between distorted image coordinates to pixel coordinates is given by where the intrinsic parameters k u and k v are the number of pixels per millimetre (horizontally and vertically, respectively), s is the skew factor whose value is usually zero in most cameras, and (u 0 , v 0 ) are the coordinates of the centre of projection.
After obtaining the pixel coordinates (u, v) of the 3D foreground points, these values are adjusted into pixel values by rounding them to the nearest integer to the values obtained. Furthermore, as the captured area by both cameras (ToF and colour camera) is not exactly the same, pixels in non-common areas are eliminated. A diagram which illustrates the proposed method is shown in Figure 4.

Figure 4 Proposed method for fusion 3D ToF and colour information.
Stages proposed to achieve the 3D ToF and colour information fusion.

Experimental setting
In this article, a method for the fusion of colour and 3D information that is suitable for active security systems in industrial robotic environments is presented. To verify the proposed methods, a colour camera, AXIS 205, and a range camera, SR4000, have been located over the workspace of the robot arm FANUC ARC MATE 100iBe. The AXIS 205 Network Camera used has a resolution of 640×480 pixels and a pixel size of 5.08×3.81 mm. The SR4000 range camera has a resolution of 176×140 pixels and a pixel size of 40×40 µm. This camera has a modulation frequency of 29/20/31 Mhz and a detection range from 0.1 to 5 m.

Camera calibration
This initial stage is intended to obtain extrinsic and intrinsic parameters from the standard CCD camera and ToF camera by means of a calibration process using a common reference frame.

Reduction of distance error
To correct for any misalignment between the range camera and the experimental panel employed, the angular deviations in x and y coordinates have been estimated and their effects have been corrected. Figure 5 shows the effect of the angle and displacement corrections in a 3D point cloud. Discrepancy curves which do not take into account pixel position have been calculated before and after angular correction. These curves are shown in Figure 6. It can be seen that the distance error is a function of the measured distance, and discrepancies values show a small improvement after the angle correction, as it was expected.  Original discrepancy values are shown in blue, discrepancy values after angle correction are presented in magenta, and discrepancy values of set ToF distance images Z T after discrepancy correction are shown in green. Dots are data obtained from the experiment; lines are splines fitted at these points.
To take into account the effect of pixel position in the distance error, a discrepancy curve has been generated for each pixel. These curves are tested by using the set Z V as input, resulting in a correction value to be applied at each pixel. Figure 6 shows in green colour the discrepancy values after the correction with a cubic spline for each pixel.
The results indicate that the improvement achieved using a discrepancy curve at each pixel is almost imperceptible, which can be explained by the selected area being too small and centred in the middle of the image, so the influence of the pixel position is very low. To check the influence of pixel position on the distance error, a larger area has been selected from a reduced set of images taken from Z V . Figure 7a shows the MSE before these corrections, while the MSE after discrepancy correction using the same discrepancy curve at each pixel is shown in Figure 7b. It can be seen that there is a reduction in the MSE over the selected area. However, the distance error is not just a function of the distance value but also it depends on the location of the pixel, as can be observed. The results obtained using a different discrepancy curve at each pixel ( Figure 7c) suggest that this kind of correction leads to better results. Hence, a discrepancy correction which takes into account pixel position and distance value is suggested for future work. Since as a result of using the larger area, there is incomplete information along the whole of the distance range, a discrepancy curve which does not take into account pixel position is used in the experiment. Then, the discrepancy curve calculated after angle correction is tested by using a ToF distance image from a real scene where data from a human and a robot arm are captured and used as input, resulting in a correction value to be applied at each pixel. Figure 8a shows in red colour the discrepancy values selected to use in discrepancy correction together with the generated cubic spline showed in cyan colour. In order to verify the effect of applying the distance correction, the initial 3D point cloud and the results after applying the distance correction are shown the in Figure 8b.

Value correction of saturated pixels
A real scene in which a human and a robot arm appear is used to illustrate the proposed methods of error corrections and the fusion of 3D and colour information. The value correction of saturated pixels in ToF information captured from this real scene has been carried out. This example of the effect of saturation is illustrated in Figure 9a which shows an amplitude image in which several saturated pixels are located on an area of the robot arm. These high values do not allow the correct visualization of the scene. Figure 9b shows the effect of saturated information over 3D data where saturation produces pixels with zero coordinate values. According to [45], pixel saturation occurs when the amplitude values are greater than 20,000, so this value has been used as threshold in Equation 20. After applying the proposed method to saturated pixels using this threshold value, Figure 10a shows the improvement achieved with this correction, allowing the view of the total scene. Figure 10b shows 3D points in which pixels with zero coordinate values have been corrected.

3D analysis for detecting foreground objects and coordinate frame transformation
To illustrate the detection of foreground objects using 3D ToF information, the background subtraction method based on the reference image model has been used. Figure 11a shows 3D information with a generated reference matrix with Z values. Figure 11b shows three-dimensional information resulting from subtracting the reference distance matrix from the real scene distance matrix, in which positive values indicate possible motion points. In order to take into account only foreground 3D points, an automatic thresholding process and the proposed method for the background and foreground 3D values assignment and selection have been applied using Equation 28. After that, the modified 3D points are shown in red in Figure 12, whereas the initial 3D points are shown in cyan colour. It can observed that in the modified 3D points, all background points have equal Z values. However, as the points of interest are the foreground points, the background points are not taken into account; therefore, the final scene 3D representation is not affected by those equal Z values. After coordinate frame transformation using Equation 30, and another automatic thresholding process to remove points classified as background, the result achieved in this example is shown in Figure 13, where the foreground object detected is represented in the world coordinate system.

Resolution increase
As the standard CCD camera employed provides a colour image which has higher resolution (480×640) than the 3D ToF information (176 × 144) provided by the range camera, the reprojection of 3D points does not have a one-to-one equivalence. Then, ToF matrices dimensions have been scaled up using a bilinear interpolation and reprojected to the colour image using Equations 30 to 32.

Jump edge reduction
With the aim to compare some usual edge filters and the morphological filter used in the detecting edge jump effect, ToF information from the scene, after interpolation of 3D points, has been processed. Figure 14a shows the results achieved using a Sobel filter in distance values from 3D information, and Figure 14b shows the results obtained using the morphological filter in distance values, obtained by using Equation 23 and establishing a dilation and erosion mask S as follows It can be observed that the edges found by applying the Sobel filter are not continuous and also are narrower than the edges found by morphological filter, so using this, most of the spurious pixels can be detected and removed from the 3D ToF points. View of morphological gradient used on z information. The edges found are continuous and have thickness of several pixels, which allows jump edge reduction.
In order to smooth the histogram of the gray scale distance image, a fist-order lowpass Butterworth filter with normalized cutoff frequency value of 0.5 is used. As an example of the application of the proposed method for the jump edge reduction, Figure 15a shows the spurious pixels produced in the object contours by the jump edge effect and Figure 15b shows the 3D points after the reduction of spurious pixels by the proposed method. Although not all spurious pixels have been eliminated, the results show a significant improvement in the reduction of this effect as most of them have been detected and eliminated.  Figure 9 in which jump edge effect appears. (b) View of the same 3D scene in which jump edge has been reduced using the morphological gradient operation.

Rreprojection of 3D foreground points into colour images
In order to obtain a matrix that contains 3D and 2D information, using the calibration parameters of the cameras, the reprojection of 3D foreground points into a colour image has been carried out. Then, the reprojected points are adjusted into pixel values and those that are in the non-common area of both cameras are removed. A selection mask is generated by using the resulting pixels, and this mask is used to select the coincident coordinates of the colour pixels. This method makes it possible to achieve a colour segmentation based on 3D information and to have 2D and 3D information in a single matrix. Figure 16 shows foreground segmentation in the colour image based on foreground detection of 3D points in the world coordinate system.

Discussion
The aim of this work is to achieve the fusion of colour and 3D ToF information in order to apply it in active security tasks for industrial robotic environments, so given the coordinates of a 3D point, this fusion allows knowing colour information and 3D position in a common world coordinate system to both cameras and the robot arm, at the same time.
After obtaining intrinsic and extrinsic parameters by a calibration process, the proposed method of distance error reduction improves the distance measurement values, and the achieved effect in the scene can be observed after the application of the information fusion method. The correction curves obtained are consistent with curves reported by other authors such as [41] and also consistent with the use of cubic splines in order to approach and correct the distance error [42]. This consistence occurs despite some differences in experimental setup, such as a reduced range of measurement, different ToF camera models, target material, and camera configuration parameters.
As a second stage, saturation error correction must be performed given that in industrial environments, certain materials such as metal or reflecting paints are often present and can produce saturated pixels in the range camera information. The results obtained show that this method works well as it allows the correct visualization of the amplitude image, and more importantly, it corrects values of saturated pixels of 3D points. If these points have incorrect values, the reprojection stage would fail in these positions, as the 3D values would be reprojected as 0, and so its 2D information would be lost.
In order to detect foreground objects, the reference image technique applied to 3D data, after error corrections, has been used and presented as a simple and fast method which yields acceptable results. The 3D points of foreground objects are correctly identified and only a few false positives are detected which can be removed easily using 2D image morphological operations. Traditionally, this technique is used in colour and grey scale images, but illumination variations result in false foreground detection.
The advantage of using ToF information is that it has a more stable behaviour in these illumination conditions. In addition, this technique has a short computational time, which is an important factor in order to be develop a suitable strategy for active security of robotic industrial environments. Then, using extrinsic parameters, the transformation of foreground 3D points from the camera to the world reference frame is carried out and scaled up by bilinear interpolation. The proposed method of jump edge reduction, applied to the resulting distance points, minimises false positives and false negatives around an object edge which arise in the pixel reprojection process as a consequence of the presence of spurious pixels that do not have correct 3D values. The achieved results can be considered acceptable since most spurious pixels are removed without changing the object shape, and therefore, a softer 3D point reprojection over objects edges in colour images is achieved.
Finally, the reprojection of the resulting 3D points to the colour image is performed. Nevertheless, as can be seen in Figure 16, this reprojection is not perfect, since in spite of having applied distance error reduction, the position of the pixels in the image has not been taken into account, and a single correction value is applied which is a function of the measurement distance but not of the pixel position in the image.

Conclusions
This paper aims to contribute to the research area of active security systems in industrial robotic environments using ToF cameras.
Despite the fact that active security in robotic industrial environments is a wellstudied topic, few previously published methods have dealt with this subject using the combination ToF cameras and colour cameras. The paper describes the development of methods for the fusion of colour and 3D ToF information as an initial step in the design of a system for collision prevention between human and manipulator robot sharing a workspace at the same time. Furthermore, this work provides a detailed mathematical description of the steps involved in the proposed method, so that any researcher can implement it.
The presented method has a different standpoint from the methods previously proposed in the literature, since a common coordinate system is defined for a robot arm, colour camera and ToF camera. The obtained calibration parameters are used to transform the 3D points from the ToF camera coordinate system into the defined common coordinate system, which are reprojected in 2D colour images. This procedure has the advantage that it gives a single matrix made of colour and three-dimensional information; therefore, 3D coordinates of objects inside the robot arm's workspace are known at the same time as their colour information. In addition to this, the proposed method for jump edge error detection, which is based on morphological gradient, allows the detection and reduction of jump edge error at points which are affected by this error. Also, in order to obtain a suitable fusion of information, a method for detection and reduction of saturated pixels, which is based on neighbour pixels information, has been proposed.
As future work, in order to improve the accuracy of fused information, a modification of the applied distance correction method is suggested. A preliminary study carried out with a small range of distances shows the influence of the pixel position in the distance measurements. Hence, a suggestion for future work is to modify the error correction so that it takes into account the position of the 3D point (measured distance and pixel location).
A possible application to prevent collisions between an industrial robot and a human would be to use colour information to characterise the detected foreground objects and to associate a security volume around each object.