Development of an Autonomous Unmanned Aerial Manipulator Based on a Real-Time Oriented-Object Detection Method

Autonomous Unmanned Aerial Manipulators (UAMs) have shown promising potential in mobile 3-dimensional grasping applications, but they still suffer from some difficulties impeding their board applications, such as target detection and indoor positioning. For the autonomous grasping mission, the UAMs need ability to recognize the objects and grasp them. Considering the efficiency and precision, we present a novel oriented-object detection method called Rotation-SqueezeDet. This method can run on embedded-platforms in near real-time. Besides, this method can give the oriented bounding box of an object in images to enable a rotation-aware grasping. Based on this method, a UAM platform was designed and built. We have given the formulation, positioning, control, and planning of the whole UAM system. All the mechanical designs are fully provided as open-source hardware for reuse by the community. Finally, the effectiveness of the proposed scheme was validated in multiple experimental trials, highlighting its applicability of autonomous aerial rotational grasping in Global Positioning System (GPS) denied environments. We believe this system can be deployed to many potential workplaces which need UAM to accomplish difficult manipulation tasks.


Introduction
During the past several decades, Unmanned Aerial Vehicles (UAVs) have shown surprising potential among numerous applications [1,2]. When mounted with different kinds of sensors, the UAVs can extend the sensing range and transform a static sensing task into a mobile sensing task. The UAVs' advantages of freely using 3-dimensional (3D) space bring possibilities to many traditional tasks, like manipulation tasks in a warehouse or a factory.
Unmanned aerial manipulators (UAMs) are known as one specific type of UAVs equipped with one or multiple robotic arms and have attracted a lot of research interests in recent years [3]. One main advantage of a UAM is that it shows promising potential in transforming passive sensing missions into mobile 3D interactive missions, like grasping [4] and assembling [5]. Aerial maneuvering and hovering capabilities make it possible for a UAM to accomplish many kinds of missions that are difficult for human workers, such as grasping plastic bottle waste on a cliff. Furthermore, there are many places, like an Amazon warehouse, with the potential to deploy a UAM system for autonomous picking and placing. To fully utilize the space in a warehouse, goods can be placed at a relatively high place, which is difficult for human workers to reach. At this time, the UAM can give a lot of help.
In future warehouses, there will be many UAMs flying simultaneously to sort the goods. Besides, places, like the Fukushima nuclear power plant in Japan, that are filled with obstacles and are hard for human workers to work inside are ideal places to deploy a UAM. Potential applications of UAMs can easily come from people's imaginations and surely not be limited to the situations mentioned above, which is the reason why we believe UAMs have a bright future.
For autonomous aerial manipulation missions, building a controllable system is always the first step. However, many factors can influence the stability of the overall system, such as the change of Center of Gravity (CoG) generated by the movements of the robotic arm, the reaction force produced by the robotic arm, and the complex aerodynamics effects. Many efforts have been made to reduce these effects [6][7][8][9][10]. A comprehensive dynamic model of hexacopter with a robotic arm has been built in [6], analyzing the effects of CoG change and mass distributions. The kinematics and dynamics models of a quadrotor with n Degree of Freedom (DoF) robotic arm have been formulated in [7]. A Variable Parameter Integral Backstepping (VPIB) controller proposed in [8] can control a UAM with better performance than a Proportional Integral Derivative (PID) controller. With a movable compensation mechanism, a multilayer architecture controller that compensates for the internal and external effects, layer by layer, to control the UAM was presented [9]. To suppress the torque generated by the movements of the robotic arm, a novel mechanism with a simplified model has been adopted in [10]. For some scenarios where multiple UAMs are required to work at the same time, some swarm studies [11,12] are also worth considering.
Usually, human operators are unable to accurately control the UAM due to the point of view change and data transmission delay. Also, for the monocular camera, the lack of depth perception in teleoperation makes it hard for the operators to decide whether the object is appropriately put into the claw of robotic arm. Such inaccuracy causes the robotic arm to be difficult to align and grasp the objects. Therefore, it is better the UAM can work without relying on external control by human operators. However, if there is no external control by human operators, the UAM needs to recognize the objects by itself, meaning we need to give the UAM perception ability. Currently only a few works [13][14][15] considered robust visual perception aided autonomous grasping. An Image-Based Visual Servo (IBVS) was implemented in [13] to help to locate the position of the targeted object. Feature models were used in [14] to find the position of known targets. In [13], correlation filters were adopted to track targets.
In recent years, end-to-end object detection algorithms like [16][17][18] have shown surprising results. Faster R-CNN [16] is an end-to-end two-stage framework for object detection. It firstly uses the Region Proposal Network (RPN) to extract proposals which contain objects, then classifies and refines these proposals. You Only Look Once (YOLO) [17] is a one-stage object detector, and it is much faster than two-stage framework. Considering efficiency, SqueezeDet [18] fully utilized SqueezeNet [19], which is a small network but achieves AlexNet [20] level accuracy with 50x fewer parameters. And SqueezeDet achieves a good tradeoff between detection accuracy and efficiency.
However, attempts to directly apply these methods in the UAM system usually end with failure or unsatisfactory robustness. The reason lies in the fact that objects, like bottles, in the UAV perspective often appeared with arbitrary poses, as shown in Figure 1a, since the camera is tight-coupled and rotated with the UAV body. Moreover, most grasping missions need a target's rotation angle for not touching the target when the end-effector is approaching. Furthermore, performance, like robustness and efficiency of the detection algorithms, plays an important role in grasping missions since it can bring more mobility to the UAM. However, current common oriented bounding boxes-based methods presented in [21,22] are designed for large-scale objects detection, like detecting a car in a large satellite image. Since the cars only make up a few pixels in a large satellite image, they need a much deeper convolution network to improve their accuracy. And this deep network cannot run in real-time performance even with an NVIDIA GTX1080 graphic card. Considering the size and weight requirement of a drone platform, the NVIDIA Jetson TX2 is the current optimal choice for mobile deep learning platforms. But methods like [21,22] are still unable to run onboard since both require more than 11G memory to load a deep convolution network, and the Jetson TX2 only has an 8G memory, let alone the real-time performance. Without a real-time performance, the grasping efficiency has been greatly reduced because the UAM needs to fly much slower to not pass the target. In order to solve the problems mentioned above, we proposed Rotation-SqueezeDet, which can regress rotation angle and position in a 2-dimensional (2D) image in near real-time. Unlike the common horizontal bounding box descriptor (c x , c y , w, h), where (c x , c y ) is the center location, and w and h are the width and height of the bounding box, respectively, Rotation-SqueezeDet introduces a new θ term, and thus uses (c x , c y , w, h, θ) to describe object position and rotation angle in the 2D image. This not only makes the detection more robust since the bounding box included fewer background, but also provides the rotation angle of the target. By using an Intel RealSense D435 depth camera, the relative 3D distance of targets can be measured in the point cloud generated by registered depth image once the target is detected. Hence, a rotation-aware grasping for autonomous grasping is possible. A glimpse of detection results is shown in Figure 1b. Now we want to deploy this detection in the UAM for autonomous grasping. However, the system mentioned in previous paragraphs have only considered neither the control problem or system compensate problem but an overall applicable indoor system. Usually, they test the system in an outdoor environment with strong Global Positioning System (GPS) signals for localization or use an expensive external visual caption system like VICON (www.vicon.com) to provide state estimation of the drone. Both of these two ways are hard to achieve in a real industrial environment. So, we adopted the lightweight Simultaneous Localization and Mapping (SLAM) system VINS-Mono [23] for indoor state estimation, giving our system the ability to fly indoors with only an inexpensive camera. Furthermore, based on the Rotation-SqueezeDet, we designed and implemented a complete indoor, workable UAM system.
The main contributions of this paper can be summarized in two folds. First, the Rotation-SqueezeDet method was proposed. This method can run on Jetson TX2 in near real-time and enable successful rotation-aware grasping. Moreover, we believe that this method will be suitable in not only the aerial grasping but also in more general missions. Second, we have designed and assembled the whole UAM platform and fully tested this system in an indoor environment. The system is affordable since it costs less than $2300 USD and can fly without relying on expensive visual motion caption system. We believe this system can be adopted to many potential workplaces, like Amazon warehouses, which need UAMs to accomplish difficult manipulation tasks. All mechanical structures are provided as open-hardware for reuse by the community (https://github.com/eleboss/UAMmech).
The rest of this paper is organized as follows. Section 2 describes the design and details of the overall system. Section 3 describes the formulation, control, and planning of the UAM system. Section 4 describes the complete vision system including Rotation-SqueezeDet. Section 5 experimentally demonstrates the system including autonomous grasping and vision system performance in an open dataset. Finally, the conclusion and future work are presented in Section 6.

Notation
The East, North, and Up (ENU) coordinate system are used as a world-fixed inertial frame corresponding to {x w , y w , z w }. Following the definition of well known Denavit-Hartenberg (D-H) parameters [24], the link frame of Link i is defined as the {x i , y i , z i }. Especially, we define {x 0 , y 0 , z 0 } as the fixed arm frame. The definition of the link frame is detailed in Figure 2 and the table in the bottom left gives the D-H parameters of the robotic arm. The body frame is assumed to be the geometrical center of the UAM denoted as indicate the roll-pitch-yaw Euler angels. {x t , y t , z t } define the detected targets in the RealSense D435 camera frame.

Hardware
In this work, a modified DJI hexacopter frame F550 (www.dji.com/flame-wheel-arf/feature) was adopted. The hardware components are shown in Figure 3.
The propulsion module of the UAM was composed of Sunnysky (www.rcsunnysky.com) x3108s motor, Hobbywing (www.hobbywing.com) platinum Electronic Speed Controller (ESC), and a 10-inch propeller. The Pixhawk4 (www.holybro.com/product/55) autopilot with PX4 (https://github.com/ PX4/Firmware) V1.8.0 flight stack and NVIDIA Jetson TX2 (developer.nvidia.com/embedded/ buy/jetson-tx2) were placed at the top of the UAM as the main computing devices. A global shutter monochrome camera (www.jinyandianzi.com) was tight-coupled with the Pixhawk using a 3D-printed anti-vibration damping plate. A Benewake TFmini (www.benewake.com/tfmini.html) laser rangefinder was chosen for being cheap, lightweight, and with up to 12 m maximum detection range, and was mounted downward facing to provide altitude feedback. An Intel RealSense D435 (www.realsense.intel.com/stereo) camera was mounted at the middle of the drone, facing forward to find the targets.

LX-15D Bus Servos
In Robotic Arm  We built a Displacement Compensation System (DCS) that can move counterweight to align the CoG and thus improves the stability of the total system. The DCS was mounted in the middle of UAM and made by 3D printed Poly Lactic Acid (PLA) material, including tow rails, a slide table, and a bus servo to provide drive force. The LEBOT (www.lobot-robot.com) LX-15D serial bus servos were chosen for being budget-friendly, lightweight, and having multiple extra structures to facilitate the installation. Another significant advantage of the bus servo is it can largely reduce the wiring complexity and control difficulty. So we could only use one serial port in the Jetson TX2 to control all servos. While the LX-15D servo can only provide 240 • feedback, we used a short-range Time of Flight (ToF) laser rangefinder GY-53 to provide the battery position feedback.
A 5200 mAh 4S-35C battery weighted 0.525 kg was used for providing enough power to the propulsion system and as a counterweight for the DCS. And this battery could sustain the flight time around 10 min. Another 1500 mAh 3S-30C weight 0.138 kg battery was used for providing power to the robotic arm and computing facilitates.
Drive forces of the robotic arm were also provided by LX-15D servos. The robotic arm had 3-DoF and could grasp objects by the end-effector. The first 2-DoF provided the robotic arm mobility to move at the planar 2D plane. The last DoF enabled a rotational grasping by cooperating with the vision system. The last DoF was of vital importance for the oriented object grasping, since the unrotated grasping could easily tip the object.
The total takeoff weight of the UAM was about 4.08 kg. Thanks to the carbon fiber material and PLA material, the robotic arm was only weighted 0.459 kg and had 43 cm extended range.

Software Architecture
The software runs on two main processors: Pixhawk4 and Jetson TX2, and all processes are running onboard. Figure 4 gives an overview of the system software architecture. Note that the Jetson TX2 is overclocked to run at the maximum clock rate and unlocked two external cores to generate more computing power.  The Robot Operating System (ROS) [25] is a pseudo-operating system which allows developers to work cooperatively by following its running mechanism. Modules related to state estimation and flying control are running on the Pixhawk4 and exchange data with the Jetson TX2 by MAVROS (https://github.com/mavlink/mavros). Visual Inertial Odometry (VIO) and object detection algorithms are running on the Jetson TX2. A state machine is adapted to set the robotic arm motion and UAM waypoint. So, the grasping position (x s w , y s w , z s w ) of the UAM can be given by: where T W B , T B 0 , T B C ∈ R 4×4 are the homogeneous transformation matrix, T B 0 transforms the fixed arm frame to body frame, T B C transforms the camera frame to the body frame, T W B transforms the body frame to the world-fixed frame, and (x g 0 , y g 0 ) is the grasping point. The standard cascaded position-velocity control of the hexacopter is well implemented inside the Pixhawk4 fly controller. More detailed information can be found at this link (https://dev.px4.io/en/flight_stack/controller_diagrams). We adopted this control strategy and well tuned its inside parameters to get a robust performance.

State Estimation
State estimation is the foundation of our system, as it provides crucial information to help other parts to achieve the best performance. In this work, we use the Pixhawk4 built-in Extended Kalman Filter (EKF) to fuse multiple sensors' feedback for state estimation.
GPS usually fails to provide global positioning feedback when in an indoor environment. Hence, in order to fly indoors without using expensive visual motion caption system, we integrated the VINS-mono VIO to provide the local position feedback, and it can provide the highest level of accuracy and robustness compared with multiple VIO [26]. The VINS-mono runs at 10hz with loop-closure and the output is rotated to ENU world-fixed frame denoted (x v w , y v w , z v w ). For the VINS-mono, we used the Inertial Measurement Unit (IMU) in the Pixhawk4 as the inertial input, and a monochrome global shutter camera ran at 640 × 400 resolution and 90 Frames Per Second (FPS) to provide clear images as visual input. To reduce drifting, the monochrome camera and Pixhawk4 are tight-coupled by the 3D printed structures, the camera intrinsic matrix and camera to IMU transformation parameters are carefully calibrated by using kalibr [27]. Due to the computation limitation and data transmission delay, the output of VINS-mono runs in Jetson TX2 has about 140ms delay compared with current IMU output. In order to synchronize these outputs, we first calculate the velocity of VINS-mono estimation (ẋ v w ,ẏ v w ,ż v w ), then apply some random movements to the UAM, so the delay time can be found by is used as external vision aid of the Pixhawk4 onboard EKF to give 100hz state estimation.
For robust flying, the main altitude feedback is not given by the VINS-mono but the TFmini rangefinder. The total delay of TFmini measurement is about 30 ms.

Robotic Arm Motion Planning
In order to find the workspace of robotic arm, the forward kinematics of a 3DoF robotic arm is given by the following equations: where L 1 , L 2 are the lengths of the first and the second links. θ 1 , θ 2 , θ 3 are the rotation angles of each joint, and (x 0 , y 0 ) is the point in arm fixed frame. Since the angle θ 3 is only related to the target rotation angle θ, the workspace of our robotic arm is equal to a planar 2DoF robotic arm model. The table presented in Figure 2 gives a clear D-H parameters definition of the robotic arm. When applying the UAM into real-world scenes, the flow generated by rotors can blow the lightweight targets away, like empty plastic bottles, and it is hard to predict whether an object can be easily blown away by the downward flow or not. Hence, a safety grasping action is better to be taken under weak flow influenced conditions. High-fidelity Computational Fluid Dynamics (CFD) simulation results of many different UAVs have been presented in [28]. From the observation of these results, the flow generated by each UAV's rotor is decreasing rapidly in the outer-wing area. Inspired by this observation, we planned the robotic arm to grasp in a weak flow area. Based on the flow measurements described in Section 5.2 and the actual mechanical movement range of robotic arm, the actual workspace is given in Figure 5 We used Inverse Kinematics (IK) to solve the desired rotation angle of each joint. Assuming the robotic arm is planning to move to (x 0 , y 0 ) point, the IK result is given by: Here, θ 2 ∈ [0, π] is the downward elbow solution, and θ 1 can be derived by: Figure 5. Workspace of robotic arm printed by using forward kinematics considered flow influence.

CoG Compensation
When the UAM is static on the ground and the robotic arm holds static at (x f 0 , y f 0 ), a symmetry placement design is utilized to make sure the G x and G y are fitted with the geometry center.
However, movements of the robotic arm can change the G x . For the dynamic G x alignment, we adapted the strategy presented in [9] called DCS, moving the battery as a counterweight since its weight can provide sufficient compensation in the relatively short moving distance.
CoG transformation of Link i and the end-effector payload from link frame to the body frame are given by: where To align the G x at geometry center, a linear slider is designed to move the battery and the position of the battery p b in the body frame can be calculated by: where m i is the mass of Link i, m b is the mass of battery. The displacement compensation plays a key role in stabilizing the UAM. Without the DCS, the change of CoG can easily make aside rotor reach the maximum thrust, leading to it being unstable. And this method worked well in our system. To guarantee the compensation performance, we chose a reasonable range of rotation speed of each joint in a robotic arm to make sure it did not exceed the maximum compensation speed of the linear slider.

Control Strategy of Robotic Arm
The total control diagram of the robotic arm and DCS are shown in Figure 6. Here,θ 1 ,θ 2 ,θ 3 ,ṗ b is the rotation speed of the servos and the θ 1cur , θ 2cur , θ 3cur , p bcur indicate the current feedback. Three PID controllers are adapted to control the robotic arm. Another PID controller uses the position of the battery p b , detected by a laser rangefinder, to control the linear slider. All PID parameters are well tuned to guarantee stable and smooth control.

Vision System
In this section, we introduce the vision system, including the light and fast oriented-object detection model called Rotation-SqueezeDet and the target localization framework based on Rotation-SqueezeDet detection results, and point clouds from the depth camera.
Our proposed model is inspired by SqueezeDet [18] and Rotation Region Proposal Networks (RRPN) [21]. As the former, SqueezeDet is a single-pass detection pipeline combining bounding box localization and classification by a single network. It appears to be the smallest object detector by virtue of a powerful but small backend network of SqueezeNet [19,29]. As the latter, RRPN is based on Faster R-CNN [16], but RRPN can detect oriented objects. The difference between Faster R-CNN and RRPN is described as below. In Faster R-CNN, the Region of Interests (RoIs) are generated by RPN, and the RoIs are rectangles which can be written as R = (x min , y min , x max , y max ) = (c x , c y , w, h). These RoIs have regressed from k anchors which are generated by some predefined scales and aspect ratios. Then these RoIs will be feed into RoI Pooling layer and some fully connected layers to obtain horizontal bounding boxes. However, in RRPN, instead, it uses Rotation anchors (R-anchors) and rotation RoI Pooling, which brings the ability to predict oriented bounding boxes denoted as R = (c x , c y , w, h, θ).
Object detection algorithms like Faster R-CNN have high accuracy but slow processing speed and large storage requirement. Moreover, RRPN is slower than Faster R-CNN, it takes twice as much time as the Faster-RCNN [21]. Thus, RRPN does not meet our run-time requirement. Considering a comparable accuracy and run-time on Jetson TX2, SqueezeDet is a suitable choice, it can run about 45FPS with 424 × 240 pixels image on Jetson TX2, and is easy to train. However, SqueezeDet cannot predict the θ of the rotated object because it uses horizontal bounding boxes. So, we designed a model which can generate oriented bounding boxes based on SqueezeDet and named Rotation-SqueezeDet, and it can run on Jetson TX2 in near real time.

Network Architecture
The overall model of Rotation-SqueezeDet is illustrated in Figure 7. In this model, a convolutional neural network, SqueezeNet V1.1, first takes an image as input and extracts a low-resolution, high dimensional feature map from the image. Then the feature map is fed into the F w × F h convolutional layers to compute oriented bounding boxes at each position of conv feature map. Next, each oriented bounding box is associated with (5 + C + 1) values, where 5 is the number of bounding box parameters, C is the number of classes, and 1 is the confidence score. And each position on the convolution feature map computes K × (5 + 1 + C) values that encode the bounding box predictions. Here, K is the number of R-anchors, each R-anchor can be described by 5 parameters as (c a x , c a y , w a , h a , θ a ), (c a x , c a y ) are R-anchor's center on image, w a , h a , θ a are the width, height and angle of R-anchor, respectively.
where (c x , c y , w, h, θ) are parameters describe the predicted oriented bounding box, (c a x , c a y , w a , h a , θ a ) are parameters describe the R-anchor, and here k ∈ Z to ensure θ ∈ [0, π).

For Each Grid
Input Image

Conv Layers
Extract Features

Oriented IoU Computation
To efficiently predict the rotation angle of objects, we introduce a parallel Intersection over Union (IoU) computing method. First, we attempt to calculate the IoU using the OpenCV's functions rotatedRectangleIntersection and contourArea directly. However, the efficiency of these functions remains poor because they cannot compute parallel. Thus, we use a simple and efficient method to approximately compute the IoU in parallel, which is to use the angle deviation of two oriented bounding boxes. The approximate IoU [30] can be computed by: where θ b 1 and θ b 2 are the rotation angle of two oriented bounding boxes, and IoU is computed by treating oriented bounding boxes as horizontal bounding boxes.

R-Anchors' Selection
The R-Anchors are different from the horizontal anchors. And for the R-Anchors' selection, we use a K-means based method described in [31] to select R-anchors' w and h to match the data distribution, we set k as 9 in K-means and treat objects' angle distribution as a uniform distribution, i.e., we set R-anchors' angle as {0, π 9 , 2π 9 , 3π 9 , 4π 9 , 5π 9 , 6π 9 , 7π 9 , 8π 9 }. Therefore, there are 81 anchors at each conv feature map position.

Object Localization
To acquire the real world position of the target, we use the RGB-D camera to detect and locate the target. First, the aligned color image and point clouds can be obtained by aligning RGB image and depth image in the same coordinate system. Next, the subarea of the total point clouds containing location information of the target which can be extracted from the whole point clouds by utilizing the detection result (c x , c y , w, h, θ). After that, we use a small central subarea of target's point clouds to calculate its real-world position (x t , y t , z t ) in the camera frame given by: where (X p , Y p , Z p ) are the coordinates of the point clouds set of the center subarea of target's bounding box, its superscript indicates the i th point value started from top left corner. L, M, N are the valid points number of X p , Y p , Z p , respectively. The central subarea of target's point clouds can be written as (c x , c y , k, k), the k × k is the size of the selected central subarea of target's point clouds, which can be calculated by: Here, we set k as 5 to reduce the computing burden. Finally, the parameters of position and rotation angle of a target can be written as: (x t , y t , z t , θ), θ is the rotation angle of target's anchor.

Experiments
The experimental setup is mentioned in Section 2. During the flight tests, all data are logged onboard with no external data transmission.

Vision System Results
We trained and evaluated our vision system based on our pervious work UAV-Bottle Detection (BD) [32], a bottle image dataset under UAV perspective. It contains about 34,791 object instances in 25,407 images labeled by oriented bounding boxes. For training and evaluating our model, 64% of the images were randomly selected as the training data, 16% as validation data, and the rest 20% as the testing data.
All object detection experiments were implemented on TensorFlow [33]. We used the pertained model, SqueezeNet v1.1, to initialize the network. And the system was trained 100k steps with a batch size of 20 and a learning rate of 0.01. Besides, weight decay and momentum were 0.0001 and 0.9, respectively. The optimizer was MomentumOptimizer.
As shown in Figure 8, the Average Precision (AP) of Rotation-SqueezeDet on UAV-BD is about 78.0% when IoU = 0.5. In Figure 9, we visualize the detection results in UAV-BD dataset. Our algorithms can be given two results; first is the common horizontal bounding box, marked by red color, without any rotation, and second is the oriented blue bounding box inside the red bounding box. We can clearly see that the blue boxes are more accurate than the red boxes since it contains fewer background pixels and locates the object more precisely.  To analyze the performance of time consumption, we deployed Rotation-SqueezeDet on three different platforms and recorded the average processing time with 424 × 240 pixels color image. The detailed information of the deployed platform is shown in Table 1, and the results are shown in Table 2. From Table 2, we can see that our algorithm has achieved near real-time performance in the Jetson TX2 and real-time performance on both the Laptop and Server platform. Notice that our algorithm could run up to 111 FPS with a Titan Xp GPU, and it can save a lot of time during the training of Rotation-SqueezeDet.  The time consumption of Rotation-SqueezeDet is mainly coming from two parts. The first part is the processing of detection model based on SqueezeNet. The second part is the processing of Non-maximum Suppression (NMS). The processing speed of detection model is related to GPU, and the processing speed of NMS is related to CPU. So, using a more powerful GPU can speed up the detection model processing, and a more powerful CPU will process the MNS much faster.

Flow Distribution Validation
In order to give a rough parameter estimation for the planning mentioned in Section 3, we disarmed the UAM and keep it flying at a certain height about 1.5 m above the ground. Then we used a Smart AS856 (en.smartsensor.cn/products_detail/productId=248.html) anemometer to measure the downward flow distribution. Since the flow is highly complex, we cannot give a very accurate flow distribution by just using an anemometer; thus, we only measured a 1D flow distribution to give a rough estimation for the robotic arm planning. The moving path of the anemometer is given in the top right corner of Figure 10 and the vertical distance from the anemometer to the rotor is about 16cm to represent the overall rough distribution below the UAM. Following this path, we recorded the average value of 5 measurements at one point and drew the curve in Figure 10. The horizontal axis is the distance from the central position of the UAM body to the outside following this path. From the observation of Figure 10, the flow speed is decreasing rapidly at about 8 cm and 30 cm, and reaching the top speed at about 21 cm which is the underside of rotors. The 8 cm is close to the inside of the UAM, and the 30 cm is close to the outside of the UAM. So, the flow speed is relatively weak when at the central position and outside position of the UAM.

Aerial Robotic Arm Moving Test
To evaluate the stability and controllability of our system, the UAV was programmed to hovering at a certain point. In Figure 11, the control errors of the system during this flight were logged and presented. After finishing the takeoff procedure, the robotic arm will automatically move to multiple points and rotate the end-effector 90 • . Figure 11. Control errors history during the whole aerial robotic arm moving.
The DCS was activated in this flight, moving the battery to compensate for the change of CoG generated by movements of the robotic arm. The standard deviation of control error in the world-fixed frame is about 3.64 cm in x axis, 2.37 cm in y axis and 1.16 cm in in z axis, which indicated our system can keep hovering at the certain point whether the robotic arm is moving or not. We noticed that during the whole test, the error increased at several points including 5, 10 and 22 s, respectively. This was caused by the moving of the robotic arm, but our system could always stabilize itself. Figure 12 shows some snapshots of the robotic arm aerial moving experiments.

Autonomous Grasping
Autonomous grasping experiments with objects placed vertically and obliquely have also been conducted. Figure 13 presents the key steps of these two grasping experiments. The subfigures were taken at the specific moments ordered by the number in the bottom left corner: x indicated the UAM is searching specific target. Here, we used the plastic bottle as the target; y indicated the UAM has detected the target and given its relative position. The UAM will then align its position to the grasping position; z indicated the UAM hovering at the grasping position and the end effector is about to grasp the bottle. { indicated the UAM dropping the bottle at the dropping point. The odometry given by the onboard EKF in 100 hz rates were simultaneously presented in the top left corner, showing our system can run indoors without using any external visual motion caption system. The detection results and the calculated relative distance of the target objects were also simultaneously presented in the top right corner during searching. In Figure 13a, the target was vertically placed at the top of a cube. By comparing images captured at x and y, it is easily to know the bottle is not shown in x but detected in y. The UAM was using a constant speeded to search in x, and speeded up to get close to target after y. The end-effector successfully approached and grabbed the target in z verified our methods worked well, and the plastic bottle was not flipped by the downward flow. In {, the UAM can automatically drop the target at the dropping point, which indicates our system is capable of automatically finishing the whole grasping task.
In Figure 13b, the procedures were basically the same as mentioned above. What is different is that we placed the empty bottle with about 45 • rotation. y gave the detected target with its relative position and rotated angle. In z, the end-effector used the rotation information of the target to successfully grasp the target.
These experimental results have shown that our system can accomplish an autonomous grasping mission.

Conclusions
In this paper, we developed an approach to enable a UAM to grasp oriented objects. The key challenges included detecting and locating objects in UAM perspective, CoG compensation, the flow influence to the objects, and indoor positioning of UAM system, all of which made the autonomous grasping mission very difficult. We showed the effectiveness of our approach in real tests with the ability to detect, locate, and grasp objects with arbitrary poses in GPS-denied environments without relying on the visual motion caption system. We believe that the proposed solution, both in terms of hardware and algorithms, will be useful in not only the aerial grasping missions but also in general grasping missions since the vision system can be applied to any kind of robotic arm. Future work will be set out to investigate how to estimate the 6D pose of objects in real-time. We will also improve the stability by using the servos with torque feedback and exploring a more advanced controller suit for the UAM.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: