A Fast and Accurate Object Detection Algorithm on Humanoid Marathon Robot

ABSTRACT


INTRODUCTION
Object detection is the most common problem for robotic vision.Provided an image, the class and position of the object must be predicted.State of the art of Convolutional Neural Network (CNN) based object detection successfully addresses the problems in this domain, even though it requires a high computing platform.Santos et al. compared the performance of three CNN algorithms to detect tree species using an RGB camera on an Unmanned Aerial Vehicle (UAV) [1].Faster R-CNN [2], YOLOv3 [3], and RetinaNet [4] were evaluated using a folding approach and successfully reached over 82.48 % validation accuracy.However, the detection process was offline, used an off-board computing platform, and was not performed in real-time.
In typical humanoid robotics applications, such as soccer-playing, marathon running, and obstacle avoidance, both of real-time performance and accuracy are crucial.In such cases, CNN algorithms prove challenging as robots require onboard computers capable of running such algorithms.Researchers address this by proposing embedded GPU modules or deep learning hardware accelerators.An embedded GPU NVIDIA Jetson TX1 on a humanoid robot to run YOLOv2 for ball and goal position detection on the soccer robot is introduced in [5].By applying this approach, the detection process reaches 20 FPS and a training accuracy of over 60%.Further, [6] proposes two hardware options for deep learning object detection on a flying robot.With these methods, the detection speed reaches 8 FPS using Jetson TX1 with a Single Shot Detector (SSD) model and 5 FPS using the SSD Mobile Net model, implemented on Raspberry Pi with an additional Intel Neural Compute Stick (NCS) accelerator.Some researchers address the problem through an efficient CNN algorithm, reducing the architecture and proposing an optimize region proposal layer.A region proposal layer that consists of efficient on-line convolutions and effective off-line optimization, followed by a detection layer using an MPGA-based CNN module and a TLD-based multi-frame fusion procedure, is proposed in [7].The proposed approach results in robust and efficient detection without relying on GPU computation.A simplified CNN architecture, called FLODNet, which consists of seven convolution and max-pooling layers, followed by three fully connected layers, is described in [8].The proposed architecture requires only 95 MB parameters and can be suitably run on a laptop CPU with detection speed around 8.85 FPS.
Our research focuses on addressing a problem in the humanoid robot marathon competition that requires a robot to detect three different markers, each a different symbol in the same dominant color.A marker must be identified using the robot camera in real-time, due it is used as feedback for the robot behavior controller.In this research, we propose a two-stage detection approach.On the first stage detection, a new method of region proposal, using a color segmentation technique, is proposed to extract a region of the object.The second is classifying the region using a shallow CNN classifier that consists of 10 layers with 83 MB parameters.The proposed approach is fast, accurate, and does not require GPU or hardware accelerator for inferencing the algorithm.The GPU only used for training the CNN classifier.
This paper is organized as follows.Section 2 explains the method for conducting this research, starting with an introduction of the research environment, followed by the dataset collection process, then an explanation of the novel region proposal algorithm, the proposed CNN architecture, and lastly, the method to train and validate the classifier.Section 3 explains in detail the results of our research, starting with the region proposal, training and validation, inference on the robot computer, and comparisons with previous work.Section 4 outlines the main conclusions and an avenue for further research.

RESEARCH METHOD
This research was conducted beginning with dataset collection, followed by designing a region proposal algorithm and developing a CNN classifier architecture.The performance evaluation was then applied to measure the classification accuracy and detection speed of the proposed method.

Research Environment
The Federation of International Robot-soccer Association (FIRA) humanoid marathon competition was selected as a research benchmark for our object detection algorithm.In the marathon competition, a humanoid robot must able to recognize a line and a series of markers.A marker is a 10 cm x 10 cm sign for the robot to navigate in following the right track, consisting of three different directions (forward, left, and right).The robot must be able to recognize these markers and localize the position in the camera frame.Figure 1 illustrates an example of a marathon competition.
We used a modified version of the Darwin OP robot, which has 22 degrees of freedom [9].A modification was applied by adding two grippers to the end effector of the robot arm, changing the default camera to the high-resolution web camera (Logitech C920), and changing the processing unit into a minicomputer with an Intel Core i3 processor.The robot architecture and specifications are shown in Figure 2 and Table 1.

Dataset Collection
The image dataset was collected from two different sources containing a total of 3,486 images.The first dataset was collected from the marathon field in our lab (Educational Robotics Centre, National Taiwan Normal University) using the robot camera and contains 660 images.The second dataset, containing 2,826 images, was collected from the marathon track of the Taiwan Humanoid 2019 robot competition and captured using a phone camera across different distances and perspectives.The image datasets were manually cropped by picture editor software to contain only the markers.These images were used as training data for our classifier model.
In order to increase the variation of the dataset, we created synthetic image data from the original images using an image augmentation method [10] by applying image transformations, diversifying brightness, changing contrast value, and placing random black rectangles on the images.Figure 3

Color-Based Region Proposal
Region proposal is commonly used to propose a region that contains a potential object for the twostages of the object detection algorithm.Prior work in [11] proposed a selective search algorithm for generating possible object locations, and it has been used with the state of the art R-CNN algorithm [12].In contrast, a selective search algorithm creates a bottleneck, extracting 2,000 candidate regions and classifying each.In this work, we proposed a simple region proposal algorithm by using color segmentation.We assume that each image frame contains a single object with a dominant color that can be identified by classifying color.For marker detection, the marker has a dominant color, black or white.The black color is chosen as a reference color to propose a region of interest (ROI) of the object.A pipeline for color-based region proposal is shown in Figure 4.A raw image is captured from the robot camera in red, green, blue (RGB) color space and converted to hue, saturation, and value (HSV) color space.HSV color conversion is started by normalizing the R, G, and B color channels as follows: , , and  are the red, green, and blue color intensities, respectively, in RGB color space in the range of 0-255 and ′, ′, and ′ are normalized red, green, and blue color intensities, respectively, in the range of 0-1.From the normalized RGB color space, a range of minimum and maximum value (∆) can be calculated as follows: = max( ′ ,  ′ ,  ′ ) (5) and   represent the minimum and maximum value of the normalized RGB color space.Hue (), saturation (), and value () of HSV color space are defined in (7), (8), and (9).
In order to apply color classification, an HSV color space image is converted into a binary image by applying a color thresholding function in (10).  ,   , and   are minimum threshold parameters and   ,   , and   are maximum threshold parameters for each , , and  value in pixel coordinates  and .The color thresholding process will result in either 0 or 255, where 0 indicates a black pixel and 255 a white pixel.A group of white pixels on a binary image (contour) is selected based on the area and ratio of the width and height to get an ROI of the potential object.The parameters of the area and the ratio of width and height are predefined and adjusted manually.The contour that matches the predefined parameter is cropped from the raw RGB image, resize to 100 x 100 pixels, then converted into grayscale color space to reduce the number of color channels.The gray color intensity () is defined as follows: This grayscale image is the output of the region proposal algorithm and becomes the input for the CNN classifier to predict the class of an object.A CNN classifier is used to predict the class of the proposed region by the color-based region proposer in Section 2.3.Generally, a CNN classifier consists of a convolution layer, a subsampling/polling layer, and a dense/fully connected layer.Our CNN architecture is adopted from classical CNN architecture LeNet-5 [13] and adds an extra dropout layer to the dense layer.The CNN architecture is detailed in Figure 5.

Convolutional Neural Network Classifier
Here,  is the layer number in the CNN architecture,  [−1] the previous input layer,  [] the convolution filter or weight, and  [] the bias.The output of the convolution layer  [] can be formulated as: [] =  []  *  [−1] +  []  (12) ( [] ) = max(0,  [] ) [] = ( [] ) Where  [] is the result of the convolution operation previous layer with weight and added bias parameter. [] is mapped onto the next layer  [] by applying the non-linear activation function ReLU ( [] ).The dimension of the convolution layer output is represented as ( ℎ [] ×   [] ×   [] ) tensor, where   [] is the number of the convolution filter,  ℎ [] and   [] are height and width of output tensor, that defined as: []   + 1 (15) []   + 1 Where  [] is filter size,  [] padding size, and  [] the stride of the convolution filter.The convolution layer is followed by a max-pooling layer that selects a maximum value from the previous layer with specific window size and stride.The dimension of the max-pooling output is represented as ( ℎ [] ×   [] ×   [] ) tensor, where   [] is same as   [−1] ,  ℎ [] and   [] are: []   + 1 (17) []   + 1 The output tensor of the last max-pooling layer is flattened into a 1D vector to be an input neuron in the dense layer.We added a dropout layer before the output layer to reduce overfitting in the training process [14].The dropout layer discards neurons with probability less than the rate ().In the output layer, a softmax activation function is used to predict the class of input images, defined as follows: Where  is the exponential number and  the number of classes.The predicted output of the CNN classifier ( ̂) is taken from the output of the softmax activation function.The loss function between predicted output and truth label is calculated using categorical cross-entropy loss in (20).

𝐿(𝑦, 𝑦 ̂) = − ∑ 𝑦 𝑗 log (𝑦 ̂𝑗) 𝑐 𝑗=1
(20) The cost function  in ( 21) is used to measure the performance of the CNN classifier, provided weight parameter (), bias parameter (), and the number of training examples ().For better performance,  and  parameters are updated in several numbers of iteration using a stochastic optimization algorithm to get the minimum value of .This process usually called the learning phase or backpropagation step.In this research, Adam [15] is used as the optimization algorithm.Adam has a parameter learning rate (), decay rate for first moment estimates (1), decay rate for second-moment estimates (2), and a small number () to prevent division error.The learning parameter is adjusted manually for a faster training process and better performance of the CNN classifier.

Training and Validation
K-fold cross-validation [16] is used to evaluate the performance of the CNN classifier in the training stage.First, the dataset is split-90% training data and 10% testing data.In the training stage, training data is split into five-folds (Figure 6).We used five classifiers with the same architecture, and each classifier uses four-folds for training and one-fold for validation.We also varied the learning rate parameter to expedite the training process and performance.The performance of the classifier was measured by averaging the training accuracy and validation accuracy from those classifiers.In the testing stage, the final performance of the classifier was evaluated by picking the ideal classifier and testing it with a dataset that was never used in the learning phase.The training stage used a computer with GPU with detail specification is listed in Table 2.In the inference stage, the ideal classifier was used on the robot computer with an input image from the robot camera to evaluate the performance of the classifier's processing time.

Region Proposal Results
Figure 7 illustrates the result of the color-based region proposal algorithm with parameters   = 26,   = 29,   = 55,   = 121,   = 255,   = 121,   = 4,651, and   = 37,500.Figure 7(a) shows the color thresholding process that results in a binary image.Figure 7(b) is a proposed region successfully cropped from the raw image and containing a potential object.In this case, the region proposal parameter depends on environmental lighting conditions and must be adjusted manually for a proper region.Some slight noise remains from the color thresholding process (Figure 7

Training and Evaluation Results
Figure 8 illustrates the result of the loss function (21) during the training phase using an Adam optimization algorithm with learning parameters 1 = 0.9, 2 = 0.999,  = 0.1, and varying learning rate ().Figure 8(a) demonstrates that training loss drops significantly after several epochs, where  = 0.001 is the proper learning rate and can decrease to less than 0.2 after two epochs.On the other hand, a small  requires the training algorithm more epoch to converge, as shown in Figure 8(c) where  = 0.00001-the algorithm takes more epochs to reduce loss function compared to Figures 8(a) and 8(b).In this research, the time to learn per 1 epoch is 6 seconds using batch size = 500 images while training.By choosing the proper parameters, as in Figure 8(a), the learning phase takes 12 seconds to generate a proper classifier model using a training computer, as described in Table 2.
Table 3 shows an evaluation of the accuracy of CNN classifier during the training process.Overall, the training and validation accuracy increase above 98.652% during training and above 99.550%during validation.By using  = 0.001, the accuracy increases the most, at 99.929% in training and 99.924% in the validation.This result suggests that learning rate  = 0.001 is the ideal parameter in the training phase, resulting in a faster training process and a more accurate classifier.
The confusion matrix from the classification result is illustrated in Figure 9, where the ideal classifier with  = 0.001 is used to predict a test dataset.Overall, test accuracy is 99.821% in the test set, which generated ten misclassified results: a forward marker predicted as a right marker, four right markers predicted as a left marker, and five left markers predicted as a right marker.This result shows that the classifier fits with the model as both validation and test accuracy result in values above 99%.

Inference Results
The inference results for our object detection approach were evaluated on robot computing hardware (Figure 10) for each marker: left, right, and forward.The marker object is identified and delimited by the green bounding box, while text at the top left corner of the image shows the predicted label from the CNN classifier.Based on the experimental results, the algorithm is capable of recognizing a marker object on different backgrounds.To evaluate the processing speed, we used a statistical approach with ten sampling measurements to determine the average processing time of our algorithm.Based on the measurement, it takes an average of 24.313 ms or 41.13 FPS to process one frame image, excluding the process of acquiring the image from the camera device.4 shown a comparison benchmark of our approach to prior work by researchers.Based on the comparison, our approach is the fastest algorithm, reaching 41.13 FPS compares to previous approaches, even while runs on an Intel Core i3 CPU.Moreover, the validation accuracy is the highest compared to other approaches.

CONCLUSION
In this paper, two stages of a CNN-based object detection algorithm are introduced to solve an object detection problem within a typical humanoid marathon robot competition.The detection algorithm consists of a color-based region proposal, and CNN classifier with six convolution and max-pooling layers, followed by four dense layers.An Adam optimizer is used to optimize the classifier model with a dataset that was collected and augmented.In the experimental results, the proposed algorithm was able to detect three categories of markers with a training accuracy of 99.929%, validation accuracy of 99.924%, and test accuracy of 99.821%.The algorithm can be implemented on an onboard robot computer with an Intel i3-5010U CPU @ 2.10GHz with a maximum detection speed of 41.13 FPS.However, setting up the color segmentation parameters should be further considered and is an area for future work.

A
Fast and Accurate Object Detection Algorithm on Humanoid Marathon Robot (ER Jamzuri et al) 205

Figure 1 .
Figure 1.Humanoid robot in the FIRA marathon competition.
Figure 2. (a) Mechanical design of the robot (b) real view of the robot.
(a) shows an example of an image from the robot camera, Figure 3(b) a cropped image, and Figure 3(c) a synthetic image from the data augmentation.Overall, the dataset contains 55,776 images from both of the original datasets and image augmentation results.The dataset distribution of each class is 17,696 images (31.727%) of forward markers, 18,000 images (32.272%) of right markers, and 20,080 images (36.001%) of left markers.(a) (b) (c) Figure 3. (a) Raw image (b) cropped images (c) synthetic images.IJEEI ISSN: 2089-3272  A Fast and Accurate Object Detection Algorithm on Humanoid Marathon Robot (ER Jamzuri et al) 207

A
Fast and Accurate Object Detection Algorithm on Humanoid Marathon Robot (ER Jamzuri et al) 209

Figure 6 .
Figure 6.Dataset for training and testing process.
(a)) and can be removed by adjusting the parameters   and   .(a) (b) Figure 7. (a) Binary thresholding result (b) a proposed region of the potential object.IJEEI ISSN: 2089-3272  A Fast and Accurate Object Detection Algorithm on Humanoid Marathon Robot (ER Jamzuri et al) 211

Figure 8 .
Loss value during the training phase.

Table 2 .
Computer specification for training CNN.

Table 3 .
Training and validation accuracy.
Figure 9. Confusion matrix of classification on the test dataset.

Table 4 .
Comparison benchmark with the prior work.