Interactions Between Specific Human and Omnidirectional Mobile Robot Using Deep Learning Approach: SSD-FN-KCF

To fulfill the tasks of human-robot interaction (HRI), how to detect the specific human (SH) becomes paramount. In this paper, the deep learning approach by the integration of Single-Shot Detection, FaceNet, and Kernelized Correlation Filter (SSD-FN-KCF) is developed. From the outset, the SSD is employed to detect the human up to <inline-formula> <tex-math notation="LaTeX">$8m$ </tex-math></inline-formula> using the RGB-D camera with <inline-formula> <tex-math notation="LaTeX">$320\times 240$ </tex-math></inline-formula> resolution. Afterward the omnidirectional mobile robot (ODMR) is driven to the neighborhood of <inline-formula> <tex-math notation="LaTeX">$2.5\sim 3.0m $ </tex-math></inline-formula> such that the depth image can accurately estimate the detected human’s pose. Subsequently, the ODMR is commanded to the vicinity of <inline-formula> <tex-math notation="LaTeX">$1.0m$ </tex-math></inline-formula> and the orientation inside −60~60° with respect to the optical axis to identify whether he/she is the SH by the FaceNet. To reduce the computation time of the FaceNet and extend the SH’s tracking, the KCF is employed to achieve the task of HRI (e.g., human following). Based on the image processing result, the required pose for searching or tracking (specific) human is accomplished by the image-based adaptive finite-time hierarchical constraint control. Finally, the experiment with the SH, who is far from and on the backside of the ODMR, validates the effectiveness and robustness of the proposed approach.


I. INTRODUCTION
Human-robot interaction (or collaboration) has received increasing attention in the last decades, since robots may act as both helpers and companions for the elderly and impaired people, especially for an aging population [1], [2]. With the great progresses in robotics and rapid evolutions on computing systems, many advanced social, service, and surveillance mobile robots have been or are being developed around the world. One of the key functionality for these advanced robots is the ability to detect specific human in real time [3]- [8]. A service robot needs to be aware of human around and track a target person to provide services. A social robot should be able to pay attention to persons in the view and keep tracking the engaged persons in the interaction. A surveillance robot may be required to monitor persons in the scene and approach a suspected person for a close observation of The associate editor coordinating the review of this manuscript and approving it for publication was Shadi Aljawarneh . his/her appearance and behavior. For these tasks, the most challenging scenarios are detecting and tracking multiple persons in frequently crowded and cluttered scenes in public environments. Real-time human detection and tracking has become one of the research focuses for service, social and surveillance robots in the literature due to its necessity for human-robot interaction.
Recently, at least two approaches for the object detection [9]- [12]. The 1 st approach connected with deep learning is two-stage process. The 1 st stage so-called ''Select Search'' finds the candidate region for the object. The convolution neural network (CNN) is employed to extract their features for prediction. In the 2 nd stage, the corresponding features for different candidates are classified by support vector machine (SVM). Although the 1 st approach is accurate enough, it needs much computation time. In contrast, the 2 nd approach aggregates the 1 st stage into the 2 nd stage of the 1 st approach such that the computation time is much reduced. Nevertheless, the accuracy is slightly reduced but acceptable. The single shot detector (SSD) belongs to the 2 nd approach [11]- [16]. In this paper, the SSD is first employed to detect the human(s), which is (are) not necessarily static. If human is not detected, the ODMR will execute the search of human by the image-based adaptive finitetime hierarchical constraint control (IB-AFTHCC) [17], [18]. After the detection of human, the ODMR approaches the detected human with an appropriate distance (e.g., 2.5∼3m). Afterward the pose between them is estimated by a depth image such that an accurate approach to a specific pose (e.g., in the vicinity of 1m and 0 • with respect to optical axis) for face recognition is accomplished.
To judge whether the detected human is the specific human (SH), his/her face is recognized by the FaceNet [19], which is one kind of face recognition [20]- [23]. FaceNet directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Only 128 bytes per face are required to achieve over 95% of robust recognition. For one SH, the calculation of FaceNet can be reduced such that the on-line searching of the SH using the IB-AFTHCC is feasible. After the identification of the SH, he/she is tracked by kernelized correlation filter (KCF) to avoid the repeated calculation of the FaceNet, extend the tracking distance, and reduce the computation time. The KCF utilizes the property of a circulant matrix and kernel to achieve the fast target tracking, and it can deal with the occlusion and scale changes in various scenes [24]- [28]. Afterwards the interactions between the SH and the ODMR (e.g., human following control) are implemented by the IB-AFTHCC. In contrast, if the face of the detected human is difficult to recognize, a searching strategy to obtain suitable pose for face recognition is progressed. If the SH is not detected, the ODMR moves forward a distance (e.g., 2.5m) to repeat the above searching procedure. To accelerate the processing, the GPU is combined with the CPU such that the on-line pose planning and control [29]- [32] and the human robot interactions [33]- [36] are more practical.
The contributions of this study are summarized as follows: (i) The learning of SSD can effectively detect the human beyond the general recognized distance of RGB-D camera system (e.g., 8m). (ii) The FaceNet is effectively learned to recognize the different faces with the recognition rate over 95% under suitable distance (0.75∼1.25m), different view angle (−60∼60 • ), different light angle (-80∼80 • ), and some occlusions. (iii) To avoid the repeated calculation of the FaceNet, extend the tracking distance, and reduce the computation time, the KCF is employed to track the SH such that human-robot interactions (e.g., human following) are achieved by the suggested IB-AFTHCC.
The outline of this study is as follows. In the next section, related work is given and discussed. In section 3, experimental setup and task description are described. In section 4, the deep learning using the integration of SSD, FaceNet, and KCF is developed. In section 5, the image-based adaptive finite-time hierarchical constraint control is employed to accomplish the required task of the human-robot interaction.
In section 6, the corresponding experiments are presented to validate the effectiveness and robustness of the proposed method. Finally, the conclusions are given in section 7.

II. RELTATED WORK
At the outset, some representative papers about the human detection using the SSD are discussed. In [11], the local similarity (encoded by local descriptors) with a global context (i.e., a graph structure) of pairwise affinities among the local descriptors, embedding the query descriptors are combined into a low dimensional but discriminatory subspace. The power of Fourier transform combined with integral image to achieve superior runtime efficiency allows for testing multiple hypotheses within a reasonably short time; it is a training-free algorithm. The algorithm in [12] includes two different components that are trained ''in one shot'' at the first video frame: a detector that makes use of the generalized Hough transform with color and gradient descriptors and a probabilistic segmentation method based on global models for foreground and background color distributions. In [13], a framework integrating support vector machine based trail detection with a trail tracker is proposed to accomplish trail direction estimation and tracking at a low cost of computation and in real time. In [14], a fine-CNN with nine-layer neural network for the detailed pedestrian recognition is designed. A pedestrian in a surveillance video is segmented and fine recognized by the improved single-shot detector and several fine-CNNs, and is supported by parallel mechanisms provided by Apache Storm stream processing framework. Without post-processing other than efficient non-maximum suppression, an end-to-end trainable fast scene text detector, which is called TextBoxes++ and detects arbitrary oriented scene text with both high accuracy and efficiency in a single network forward pass, is developed [15]. In [16], the deep neural network with RGB-D image input predicts multiple grasp candidates for a single object or multiple objects, in a single shot with the real-time processing less than 0.25s. Some representative papers for face (emotion) recognition are addressed as follows. In [19], FaceNet directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Only 128 bytes per face are required to achieve over 95% of robust performance. In [20], a 3D face recognition method based on the fusion of shape and texture local binary patterns on a mesh is presented. It utilizes that the mesh surface preserving the full geometry doesn't require normalization, can accommodate the partial matching, and allows the early level fusion of texture and shape modalities. In [21], the CNN based face recognition using the ORL and Yale databases with gray scale images demonstrates the similar performance of the paper [23]. Nevertheless, its recognition rates are sensitive to the quantity of occlusion or pepper and salt noise. In [22], the enhanced face recognition method is proposed by utilizing local binary patterns histogram descriptors, Multi-K-nearest-neighbor, and Back Propagation Neural Network. Since the correlation method utilized requires substantial computation time and large storage, features reduction and face representation are required. In [23], a local binary pattern histogram for the face recognition from the RGB together with suitable feature dimension of Depth images, which have a wide range of variations in head pose, illumination, facial expression, and occlusion in some cases, is developed to extract the facial features and then improve the recognition rate. In [36], two-layer fuzzy support vector regression-Takagi-Sugeno model is proposed for the emotion understanding in human-robot interaction, e.g., the scenario of drink in different emotions. However, its maximum average video-based recognition rate for different genders, provinces, and ages is 77.62%, which is not excellent.
Finally, the representative papers about the tracking of the SH using kernelized correlation filter (KCF) are addressed. In [24], both KCF and dual correlation filter outperform topranking trackers such as structured output tracking with kernels or tracking-learning detection on a 50 videos benchmark, despite running at hundreds of frames-per-second, and being implemented in a few lines of code. It indicates that KCF is indeed a fast and effective tracker. A real-time RGB-D object tracker based on the KCF, which utilizes the property of a circulant matrix and kernel to achieve fast target tracking, is proposed to deal with occlusion and scale changes in various scenes [25]. In [26], tracking algorithm with reducing feature dimensionality and interpolating correlation score are employed to reduce the computational cost for fast tracking. Occlusion and fast motion problems can be effectively solved by the expansion of the search area with the speed of 69.5 frames per second, which is suitable for real-time application. By integrating an adaptive obstacle detection strategy within a KCF framework, a fast and robust approach for obstacle detection and tracking is developed [27]. In [28], an online learning method based on the KCF and assembles different feature channels to kernelized experts is employed to track vehicles at night. By estimating their reliabilities, the appearance model to focus on the most discriminative visual features achieves the classification.
Finally, the deep learning integrating the advantages from SSD, FaceNet, and KCF is proposed to deal with the corresponding human-robot interactions [33]- [36].

III. EXPERIMENTAL SETUP AND TASK DESCRIPTION A. EXPERIMENTAL SETUP
The experimental setup of the proposed omnidirectional mobile robot (ODMR) in Fig. 1(a) includes the following four parts: (i) three dc servomotors, (ii) one motion control platform, (iii) a laptop for image processing, and (iv) a RGB-D camera system. Three dc servomotors are the model no. 578296 with gear ratio 66:1 from Maxon Co.; in contrast, the gear ratio between the motor and wheel is 1.2:1. The driver is the model no. ESCON-422969 with the following important specifications: (i) power: 700W , (ii) input voltage: 10∼70V , (iii) peak current: 30A, (iv) continuous current: 10A, (v) weight: 204g, (vi) dimension: 115 × 75.5 × 24mm. (iv) 4GB DDR3 SODIMM. The suggested adaptive finitetime hierarchical constraint control (AFTHCC) algorithm is computed in the Intel R Atom N2600; the PWM for driving the motor and the decoder for obtaining motor velocity are executed in the FPGA (cf. Fig. 1(b)). On the other hand, the laptop is the ASUS computer with the following important specifications: (i) Intel Core i7 CPU, with six-core and 12 threads, (ii) GPU with GTX 960M, CUDA core 640.
The proposed RGB-D vision system is the model of ASUS Xtion PRO version, which provides not only depth data but also RGB colour image and audio (using a microphone array). It is installed at 145 cm and has the following important specifications: (i) Dimensions: 177.8×48.2×38.1mm. (ii) Weight: 540g, (iii) USB connection: consumption < 2.5W, (iv) Detection range: 0.8m ∼3.5m,

B. TASK DESCRIPTION
The fundamental tool of deep learning is the Convolutional Neural Network (CNN); CNN is made up by Convolution, Pooling, and Full Connection layers. Their sizes and layers are dependent on practical applications. The Convolution layer can have different length, width and height such that the dimension, complexity, and characteristics of input image can be included into different dimension of convolution and nonlinear transform. To avoid the losing features of input image, the resolution of CNN will keep the same as that of input image. In contrast, the Pooling layer is employed to reduce the corresponding resolution. Suitable dimension also can increase the robustness [37]. Finally, multi-dimension features are transformed into one dimension feature by the full connection layer.
In this study, the SSD is first employed to detect the human, which is not necessarily static. If human is not detected, the robot will execute the search of human. After the detection of human, the ODMR will approach the detected human with an appropriate distance (e.g., 2.5∼3m). Then using a depth image estimates the pose between them such that an accurate approach to the detected human with specific pose (e.g., 0.75∼1.25m and −60∼60 • ) is achieved. To judge whether the detected human is the specific human (SH), his/her face is recognized by the FaceNet [19]. If the SH is identified, the SH will be tracked by the KCF [24]- [28] to avoid the repeated calculation of the FaceNet. Afterward, the corresponding interactions between the SH and the ODMR (e.g., human following control) are implemented by the image-based adaptive finite-time hierarchical constraint control (IB-AFTHCC) [17], [18]. If the face of detected human is difficult to recognize, the ODMR will be commanded to a suitable pose to execute the FaceNet algorithm. If the detected human is not the SH, the ODMR will continue the searching (e.g., moving forward a distance of 2.5m, repeating the above procedure) until the SH is detected. The corresponding flowchart is depicted in Fig. 2. To accelerate the processing, the GPU is combined with the CPU to accomplish the processing time about 50 ms such that the on-line pose planning and control [29]- [32] and the human-robot interactions [33]- [36] are more practical. Moreover, the parameters of Fig. 2 are explained as follows: (i) S: S = 1, 2, 3 respectivly determines the initation of the SSD, FaceNet, and KCF functions, (ii) F: the index to determine whether the human following is executed, (iii) C 1 : the counter to determine whether which region is searched, (iv) C 2 : the counter to determine whether the face searching is required, (v) d v , ψ: the vertical and orientation between human and the ODMR, (vi) d f : the distance index to determine whether the task of human following is implemented.

IV. DEEP LEARNING APPROACH: SSD-FN-KCF
The proposed deep learning technique for the human detection, the specific human's recognition, and the tracking of specific human are tackled by the integration of Single-Shot Detection (SSD), FaceNet (FN), and Kernelized Correlation Filter (KCF).

A. HUMAN DETECTION USING SSD
Recently, at least two deep learning approaches for the object detection are developed. The 1 st approach is twostage process. The 1 st stage so-called ''Select Search'' finds the candidate region for the object. The convolution neural network (CNN) is employed to extract their features for prediction. In the 2 nd stage, the corresponding features for different candidates are classified by support vector machine (SVM). In this study, the single shot detector (SSD) belongs to the 2 nd approach [11]- [16]. The corresponding human detection using the SSD is depicted in Fig. 3. Its red rectangle is the VGG16 CNN in the blue rectangle with the replacement of the original FC6 and FC7 by the Conv6 and Conv7, the removal of all connection layers, and the extra addition of CNNs. The proposed approach uses the trained CNN model to predict each default box. It is also called the possibility of each class as the ''score.'' Afterwards, non-maximum suppression (NMS) is employed to screen the best prediction. Together with the bounding box and score, the corresponding feature vector is extracted. Subsequently, the learning process of the SSD including Preprocessing (i.e., steps 1 and 2) and Training (i.e., steps 3∼5) is introduced as follows (cf. Fig. 4).

1) INPUT DATASET
Prepare the corresponding images with the same resolutions including human and nonhuman. VOLUME 8, 2020

2) LABELLING
The dataset are first classified with human and nonhuman. Then the coordinates of their ground truth boxes are labelled for later training.

3) PREDICTION
Since the detected humans possess different sizes caused by different human or different distance, the height and width of the CNN are different. Hence, the default boxes in Fig. 5 with respect to the feature map center include different heights and widths, which are determined by the scale and aspect ratio. At the outset, the scale is expressed as follows: where S k is the scale of the k-th layer in the feature map, S max denotes the scale of the highest layer (e.g., S max = 0.95), S min denotes the scale of the lowest layer (e. g., S min = 0.2), and m is the number of feature map, e.g., m = 6. On the other hand, the aspect ratios are assumed to be a r = {1, 2, 3, 1/2, 1/3}. Their relations with height and width are as follows (cf. Fig. 5): where w a k and h a k are the width and height of the k-th layer. Moreover, the scale of the default box for a r = 1 is described as follows: It is assumed that each feature map cell contains n d default boxes (default: n d = 6), each cell possesses 4 offsets, the number of classification is P (e.g., P = 1), the dimension of feature map is k × l. Then the total number of the default box is (P + 4) × n d × k × l.

4) HARD NEGATIVE MINING
Using Jaccard Overlap estimates the similarity between the ground truth box B g and the default box B d : If max J (B g , B d ) > 0.5, then they are positive boxes; otherwise, they are negative boxes. To maintain the stability of training and loss value, hard negative mining only chooses the higher score of negative boxes (i.e., far away from 0.5) such that the ratio between the positive boxes and the negative boxes is 1:3 (cf. Fig. 7(e) and 7(f)).

5) LOSS FUNCTION
The objective loss function is defined by the weighted combination of classification loss (subscript cla) and localization loss (subscript loc): where N is the total matching number between the ground truth box and the default box, in general α = 1, and the classification loss is calculated by softmax loss [11], which purpose makes the confidence of positive and negative samples with enough robustness to recognize the ground truth box of each class. Here x is a parameter, c denotes the confidence for the detection. The classification loss is defined as follows: Before introducing the localization loss, a box in image processing is described by 4 Similarly, the localization loss is defined as follows: where The purpose of the computed loss function in (10) is to reduce the differences between prediction box and default box and between default box and ground truth box, such that the iterative prediction of the difference between prediction box and ground truth box becomes smaller. After the achievement of steady-state matching error, the predicted bounding box for the target human is satisfactorily obtained. At the outset, the resolution of RGB image is set as 320 × 240, and the total number of trained images is 512. Some representative  positive and negative samples for human detection are shown in Fig. 6. The corresponding loss function responses are presented in Fig. 7 (a), (b). After 23,000 time step to train, the total loss function converges in the neighborhood of 1.8 (Fig. 7(c)). The numbers of positive and negative samples are shown in Fig. 7(d), (e), in which the number ratio is about 1:3. Since each frame is different, the corresponding responses in Fig. 7 are fluctuated. Even though, the average responses in the solid lines are also presented.

6) PERFORMANCE EVALUATION
To validate the effectiveness and robustness of SSD, the (walking) human with different distances and view angles are investigated. The results are shown in Tables 1-3.
In summary, irrespective of the distance (<8m) and view angle (<=180 • ) between human and camera, the human detection rate above 98.6% and the frame rate above 30 fps are achieved by the resolution of 320×240. The performance evaluation can refer to URL: https://youtu.be/gUJd0ATnlXI.

B. SPECIFIC HUMAN DETECTION USING FACENET
In the previous studies about face recognition (e.g., LBP with SVM [23]), they need build the feature descriptor with suitable dimension, and then a (multiclass) support vector machine is trained to obtain a model for classification. In contrast, the approach of FaceNet directly learns Euclidean mapping transformed from the face pattern mapping. The similarity of face pattern is expressed as the Euclidean distance in the Embedding layer using 128 dimensionalities [19]. After the learned model, the compared Euclidean distance error below a specific threshold is set as a classification standard. The flowchart (or procedure) of FaceNet includes Batch (input frame), Deep Architecture, L2 Normalization, Embedding, and Triplet Loss. Since the classifier is not required, the training procedure is more fast and effective. Finally, the comparison between the trained model and the on-line calculated ''Embedding values'' for an image at specific time interval determines whether the detected human is the SH or not.

1) DEEP ARCHITECTURE
The CNN of FaceNet is chosen from Inception ResNet-v2 of Google. The details are described as follows [19]: (i) The 1 st layer NN1 is made up by ZF-Net with 22 CNN and extra CNN, (ii) The 2 nd layer NN2 is made up by many Inceptions, (iii) NN3, NN4, NNS1, and NNS2 are employed to reduce the resolution or scale. VOLUME 8, 2020

2) NORMALIZATION
To map the image x ∈ n into hypersphere, L2 (or Euclidean distance) normalization x ∈ n is considered.
In general, the feature vector in practical image is discontinuous or discrete. Hence, one-shot encoding (i.e., one feature uses one code) can transform this discrete feature into another feature in Euclidean space. The advantages of one-shot encoding are less computation and strong feature description.

3) EMBEDDING
Even though the advantages of one-shot encoding, the storage will be largely increasing as the number of feature increases. To improve this drawback, the Embedding layer in the FaceNet is reduced to a 128 dimensional byte vector, satisfying the requirement of face recognition.

4) TRIPLET LOSS
The loss function of the FaceNet is triplet loss [19]. There have three features for the modeling: (i) the desired feature is denoted as ''Anchor'', (ii) the feature slightly deviating from Anchor is denoted as ''Positive'', and (iii) the features dominantly deviating from Anchor is called as ''Negative.'' The purpose of triplet loss is to minimize the Euclidean distance between Anchor and Positive, and simultaneously maximize the Euclidean distance between Anchor and Negative (cf. Fig. 8). It is assumed that Anchor and all Positives are similar, and that Anchor and all Negatives are dissimilar. In brief, the following loss function is defined to be minimized.
where β > 0, Based on the distances among Anchor, Positive and Negative, triplets are categorized as (i) Easy triplets with L Tr < 0, (ii) Hard triplets with L Tr > β,(iii) Semi-hard triplets with 0 ≤ L Tr ≤ β. If Negative belongs to Easy triplets, the Triple loss equals zero. If Negative belongs to Hard triplets, the similarity between them is large such that it is difficult to recognize. In contrast, if Negative belongs to Semi-hard triplets, these datasets are learned to maximize the face recognition rate.

5) COMPARISON
After the effective training model of the FaceNet, the output of Embedding layer for the real-time image will be compared with that of the trained model. If the Euclidean distance between them is smaller than a specific threshold, then the recognition of the specific human (SH) is achieved. Otherwise, it is not the SH. In summary, the end-to-end training both simplifies the setup and shows that directly optimizing a loss relevant to the task at hand improves performance [19].

6) PERFORMANCE EVALUATION
Two important factors to affect the face recognition rate (FRR): one is the parameter setting in the FaceNet (e.g., β, the resolution of camera, the number of training samples), the other is the environmental change (e.g., the distance or view angle between camera and robot, the occlusion of human, the lighting condition).The SH in this paper is shown in Fig. 9(a). On the other hand, non-SH is presented in Fig. 9(b). At the outset, β = 0.6, the resolution 320 × 240, 300 training samples, and the height of camera at 145 cm are assigned. The face recognition rate (FRR) for different distances, view     angles at 1m and 1.5m, half faces, and lighting angles are investigated and listed in Table 4, 5, 6, 7, and 8 respectively.
Based on the results of Tables 4-8 and the FOV of camera with the horizontal 58 • and vertical 45 • view angles, the important observations are concluded as follows: (i) The redcross region in Fig. 10 (i.e., between 0.75m and 1.25m with the orientation less than 60 • ) at the height of 1.45m is called as the effective face recognition with the FRR over 95% using the FaceNet with RGB-D camera. (ii) Human in the front of camera with one meter, without the occlusion of two eyes and the occlusion not over 50% possesses the FRR over 95%. (iii) Light orientation not over 90 • in the dark environment also has the FRR over 95%. (iv) The occlusion of the upper half (i.e., two eyes and nose) only has FRR 17.3. It is reasonable since two eyes are one of key sub-region for the face recognition. (v) The processing frequency is about 20 fps, which is 50% smaller than that of the SSD. (vi) The performance evaluation can refer to the attached URL.

C. SPECIFIC HUMAN TRACKING USING KCF
Since the SSD and FaceNet are only for a specific frame to judge whether the specific human (SH) is recognized, the corresponding result will disappear in the next frame. Hence, if the SH is detected, then the Kernelized Correlation Filter (KCF) can maintain to track the SH. The KCF is identified as a tracking approach, i.e., the simultaneous SH tracking and training prediction.

1) PRELIMINARIES
To consider the practical situation, a nonlinear mapping for sample x i around the (blue) rectangle detected by the FaceNet is denoted as ϕ(x i ). Its kernelizing regression function becomes y i = f (x i ) = w T ϕ(x i ). Define the following circulant matrix [24]: The regularized least square weight estimation is to minimize the following cost function: . Then the minimization of (14) yields is the row i and column j of K .

2) DIAGONALIZATION OF CIRCULANT MATRIX USING DFT
wherex is the DFT of x, X H = (X * ) T , and From (15), (16), and FF H = I n , the following result is achieved.
Or, equivalently, The regression function or all candidate patches with z is computed as follows: Here f (z) is a vector, containing the output for all cyclic shifts of z, i.e., the full detection response. From (19), Here denotes the pointwise operator. It indicates that the output of each input vector is the pointwise multiplication of k xz andα. The maximum output is the maximum likelihood of the next position of target.

3) PERFOMANCE EVALUATION
The representative experimental result is shown in Fig. 11. The green, blue and red rectangles in Fig. 11 are the human detected by the SSD, the SH recognized by the FaceNet, and tracked by the KCF, respectively. After the Fig. 11 (v), the FaceNet fails to recognize the SH but the KCF still continues to track the SH. The reasons for the failure are that Fig. 11 (iv), 11(v) and 11(vi) have too large viewing angle (cf. Tables 5 and 6), that Fig. 11(vii)-11(xii) have too large distance between them (cf. Table 4). It confirms the effectiveness of the KCF.

V. IMAGE-BASED POSE TRACKING USING ADAPTIVE FINITE-TIME HIERARCHICAL CONSTRAINT CONTROL A. IMAGE-BASED DESIRED POSE
The width of the green rectangle for the human detection using the SSD is employed to estimate the vertical and horizontal distances between the ODMR and the detected human. As distance is smaller than 3.5m, the depth image can be accurately estimated. Hence, only 3 ∼ 8m between them are presented in Table 9.
Since the detected human is too near the vision system, e.g., 4m, the whole ROI for the human is infeasible. Only the green rectangle's width is used to estimate the 2D pose between ODMR and detected human. Based on the result of Table 9, the estimated vertical distance between the ODMR and the rectangle's width is achieved as follows: Moreover, the horizontal distance d h,j at distance j meter is computed by the central pixel c p at d v = 3, 4, 5, 6, 7m: The d h between d v = 3 and 7m is achieved as follows: The desired 2D pose for the AFTHCC in the next subsection becomes

B. AFTHCC
From the outset, the system variables and parameters of ODMR are listed in Table 10 and Table 11, respectively. These values in Table 11 are obtained from the manual of motor, the dimension of ODMR, the knowledge of friction.
Based on the information of image processing, the required pose tracking for searching or tracking (specific) human is achieved by the ODMR with AFTHCC [17], [18]. At first, the kinematic relation of the ODMR is depicted in Fig. 12.
The overall control block diagram of the proposed AFTHCC is shown in Fig. 13 including adaptive finitetime virtual desired pose (AFTVDP) and adaptive finite-time tracking control (AFTTC).
The uncertainties of the indirect mode are given as follows: is the uncertainty caused by the time-derivative of the sign function at zero; Y d2,eq (t) is the uncertainty of the Y d2,eq (t) in (38b); A 123 (Z 1 , t) is the uncertainty of the A 123 (Z 1 ); G(Z 1 , t) is the uncertainty of the G(Z 1 ) in (39a); H (Z 1 ) is given in (39b); and 0 < β 1 < 1. Since Ė 1,0 (t) ≤ 2/ t, where t is the sampling time, is bounded, E 1 (t) α 1Ė 1,0 (t) becomes smaller as E 1 (t) ≈ 0. Hence, the upper bound of (31a) is assumed as follows: where 0 ≤ ρ 11 , ρ 12 are bounded but unknown and learned by the adaptive law (37). Similarly, the uncertainties of the direct mode and their upper bound are described as follows: Here 0 < β 2 < 1,0 ≤ ρ 21 , ρ 22 are bounded but unknown, and learned by the adaptive law (37). In addition, the uncertain switching gain of the AFTVDP satisfies the inequality: Here (38c). Similarly, the uncertain gain of AFTTC satisfies the following inequality: The first switching surface for the indirect mode is as Similarly, the second switching surface for direct mode is as follows: Here 2 (t) ∈ 3 , H p , H i , H n > 0 ∈ 3×3 are constant diagonal matrices, H p is nonsingular, and 0 < α 2 < 1. The adaptive laws for two unknown coefficients of the upper bounds of the uncertainties in indirect and direct modes (i.e., ρ ij , i, j = 1, 2 in (31b) and (32b)) are designed as follows: Hereρ ij (t) = λ ij i (t) j+(2−j)β i 1 − δ ij ρ ij (t) , ρ ij,M = max{ρ ij }, i, j = 1, 2 are assumed to be known; λ ij , δ ij > 0 are denoted as the learning rate and e-modification rate, respectively; ρ ij, Then the AFTVDP is designed as follows: where ε 1 is a small positive constant, the components of G(Z 1 ) = g i (Z 1 ) i=1,2,3 , and H (Z 1 ) = h ij (Z 1 ) i,j=1,2,3 are given as follows: To achieve the finite-time to zero switching surface and then tracking error, the nonlinear switching gains [38], [39] are designed as follows: where F 1j ≥ I 3 , κ 1ij > 0, i, j = 1, 2. In addition, sign(E 1 ) α 1 −1 is a diagonal matrix. The AFTTC for the system (25)-(26) is designed as follows: Similarly, ε 2 is a small positive constant and the diagonal nonlinear switching gains are designed as follows: where F 2j ≥ I 3 , κ 2ij > 0, i, j = 1, 2. The stability analysis can refer to [38], [39].

VI. INTERACTIONS BETWEEN SPECIFIC HUAMN AND ODMR
At the outset, the specific human is at the pose (7m, 1m, 180 • ), i.e., the human face is back to the ODMR. The corresponding human robot interactions with the control parameters in Table 12 are depicted by the important snapshots from the ODMR as shown in Fig. 14. Furthermore, the control response using the proposed AFTHCC is presented in Fig. 15, including (a) trajectory tracking in XY plane, (b) 2D pose, (c) control input, (d) switching surfaces of the indirect and direct mode, (e) tracking errors of the indirect and direct modes, (f) the estimated coefficients for the upper bounds of the indirect and direct modes' uncertainties [18].
The responses are illustrative as follows: (i) At the very inception, the SSD detects the human in the first field of view (FOV). Since no human is detected, the ODMR turns 90 • . (ii) Even the background lighting is not uniform, the human over 7m is detected by the SSD. Then the 2D pose between them (i.e., (24)) is computed. (iii) Based on the IB-AFTHCC, the ODMR is controlled to the desired orientation φ d (t). Human is at the central position of FOV. (iv) The 2D pose (x d , y d , φ d ) between them is computed. (v) The ODMR is controlled to the desired position (x d , y d ) i.e., 2.5∼3m between them, by the IB-AFTHCC. (vi) The 2D pose between them is computed by the depth image. The ODMR is controlled to the desired orientation φ d (t) by the IB-AFTHCC such that human is at the central position of FOV. (vii) The vertical position of the ODMR (about 1m and 0 • between them) is also controlled by the IB-AFTHCC. Since the SH is not recognized, the ODMR is controlled to another desired pose (about 1m and 90 • between them). (viii) The FaceNet is applied to recognize the SH with his right hand to occlude the recognition because the SH is probably recognized at 90 • orientation (cf. Table 5). Since the SH is not recognized, the ODMR is controlled to another desired pose (about 1m and 180 • between them). (ix) Since he is the SH, the KCF tracker initiates (i.e., the red rectangle denotes the tracking of the SH). (x) Since the KCF works to track the SH, the FaceNet doesn't work; hence, the blue rectangle in the snapshot (ix) of Fig. 14 disappears. (xi) The SH turns 180 • to execute the task of human following; the green and red rectangles still remain unchanged. (xii) As the distance larger than 1.5m, the ODMR tacks the SH until less than 1.5m, and then stops.
The maximum position and orientation errors are respectively about 4 cm and 5 • which are excellent for the motion control task. Finally, the planned human-robot interactions are successfully accomplished. The corresponding experimental video can refer to the URL: https://youtu.be/ FF-cf7nv5Uo.

VII. CONCLUSION
The deep learning approach using the SSD-FN-KCF is developed such that a specific human (SH) is identified and tracked to execute the required interactions. The green, blue, and red rectangles are the outputs of the SSD, FN, and KCF, respectively. Besides the image-based adaptive finite-time hierarchical constraint control (IB-AFTHCC) executes the planned poses, three techniques are integrated to enhance each method's advantages and avoid their drawbacks. The SSD using the RGB-D camera with the resolution of 320 × 240 can detect humans up to 8m. It is a favorable result as compared to other state-of-the-art methods [3]- [8]. Due to the low resolution of RGB for FaceNet, only up to 1.25m can successfully recognize the specific face (or human). Even though, the larger pose variations including the occlusion of human and the large change of lighting orientation still confirm the robustness of face recognition. To reduce the repeated face recognition, the KCF not only accelerates the processing time but also extends the tracking distance of the detected SH to achieve the satisfactory task of HRI. Furthermore, the advantages of ODMR (i.e., simultaneous translation and rotation) and the IB-AFTHCC fulfill the HRIs, e.g., search, detect, track the (specific) human, and human following. One of our future studies is to extended dynamic face emotion recognition result up to 3.5m using stereo camera and very deep CNN method [40], [41].