AirCapRL: Autonomous Aerial Human Motion Capture Using Deep Reinforcement Learning

In this letter, we introduce a deep reinforcement learning (DRL) based multi-robot formation controller for the task of autonomous aerial human motion capture (MoCap). We focus on vision-based MoCap, where the objective is to estimate the trajectory of body pose, and shape of a single moving person using multiple micro aerial vehicles. State-of-the-art solutions to this problem are based on classical control methods, which depend on hand-crafted system, and observation models. Such models are difficult to derive, and generalize across different systems. Moreover, the non-linearities, and non-convexities of these models lead to sub-optimal controls. In our work, we formulate this problem as a sequential decision making task to achieve the vision-based motion capture objectives, and solve it using a deep neural network-based RL method. We leverage proximal policy optimization (PPO) to train a stochastic decentralized control policy for formation control. The neural network is trained in a parallelized setup in synthetic environments. We performed extensive simulation experiments to validate our approach. Finally, real-robot experiments demonstrate that our policies generalize to real world conditions.


I. INTRODUCTION
Human motion capture (MoCap) implies accurately estimating 3D pose and shape trajectory of a person. 3D pose, in our case, consists of the 3D positions of the major human body joints. Shape is usually parameterized by a large number (in thousands) of 3D vertices. In a laboratory setting MoCap is performed using a large number of precisely calibrated and high-resolution static cameras. To perform human MoCap in an outdoor setting or in an unstructured indoor environment, the use of multiple and autonomous micro aerial vehicles (MAVs) has recently gained attention [1], [2], [3], [4], [5]. Aerial MoCap of humans/animals facilitates several important applications, e.g., search and rescue using aerial vehicles, behavior estimation for endangered animal species, aerial cinematography and sports analysis.
Realizing an aerial MoCap system involves several challenges. The system's robotic front-end [2] must ensure that the subject i) is accurately and continuously followed by all aerial robots, and ii) is within the field of view (FOV) of the cameras of all robots. The back-end of the system estimates the 3D pose and shape of the subject, using the images and other data acquired by the front-end [1]. The front-end poses a formation control problem for multiple MAVs. In this Authors are with the MPI for Intelligent Systems, Tübingen, Germany.
{firstname.lastname} @ tuebingen.mpg.de The authors would like to thank Prof. Dr. Heinrich Bülthoff for his constant support and for providing us the access to the Vicon tracking hall in MPI for Biological Cybernetics. The authors also thank Igor Martinović and the anonymous reviewers for extremely helpful suggestions. letter, we propose a deep neural network-based reinforcement learning (DRL) method for this formation control problem.
Below, we describe the drawbacks in state-of-the-art methods and highlight the novelties in our work to address them.
In existing solutions [1], [2], [3] the front and back end are developed independently -The formation control algorithms of the existing aerial MoCap front ends assume that the person should be centered in every MAV's camera image and she/he should be within a threshold distance to each MAV. These assumptions are intuitive and important. Also, experimentally it has been shown that it leads to a good MoCap estimate. However, it remains sub-optimal without any feedback from the estimation back-end of the MoCap system. The estimated 3D pose and shape are strongly dependent on the viewpoints of the MAVs. In the current work, we take a learning-based approach to map and embed this dependency within the formation control algorithm. This is our first key novelty.
Existing approaches [2], [3], [4], [5] depend on tediously obtained system and observation models. State-of-the-art solutions to formation control problems involving perceptionrelated objectives, derive observation models for the robot's camera and the desired subject to compute real-time robot trajectories [2], [6], [7]. As these observation models are based on assumptions on the shape and motion of the subject, sensor noise and the system kinematics, the computed trajectories are sub-optimal. We overcome the aforementioned issue by addressing the formation control for aerial MoCap as a multi-agent reinforcement learning (RL) problem. This is the second key second novelty of our approach. We let the MAVs learn the best control action given only the subject perception observable through the MAV's on-board camera images, without making any assumptions on the observation model.
The key insights which enable us to do this are i) the sequential decision making nature of the formation control problem with MoCap objectives, and ii) the feasibility of simulating control policies in synthetic environments. We leverage the actor-critic methodology of training an RL agent with a centralized training and decentralized execution paradigm. At test time, each agent runs a decentralized instance of the trained network in real-time. We showcase the performance of our method in several simulation experiments. We evaluate the quality of the generated robot trajectories using the pose and shape estimation algorithms in [8], [9] and [1]. Additionally, we compare our new approach with the state-of-the-art model-based controller from [2]. A demonstration and comparison with the method of [2] on a real MAV is also presented. Code and implementation details of our method is provided in the supplementaty material.

II. RELATED WORK
Aerial Motion Capture Methodologies: A marker-based multi-robot aerial motion capture system is presented in [4]. Here, pose of the person and the robots are jointly estimated and optimized online. A multi-robot model-predictive controller is used to compute trajectories which optimizes the camera viewing angle and person visibility in the image. Marker-based methods suffer from tedious setup times and optimal control methods for trajectory following can lead to sub-optimal policies for motion capture due to perceptual objectives. A markerless aerial motion capture system using multiple aerial robots and depth cameras is proposed by authors in [10]. They use a non-rigid registration method to track and fuse the depth information from multiple flying cameras to jointly estimate the motion of a person and the cameras. Their approach works only indoors and the initial registration step can take a long time similar to other marker based method setups. In one of our previous works, [11], we introduced a vision-based (monocular RGB) markerless motion capture method using multiple aerial robots in outdoor scenarios. The pose and shape of the subject and the pose of the cameras are jointly estimated and optimized in [11]. While our other previous work [2] introduces a frontend of our outdoor aerial MoCap system, [11] describes the back-end.
Perception-Aware Optimal Control Methods for Target Tracking: In [6], a perception-aware MPC generates realtime motion plans which maximize the visibility of a desired static target. In [12] a deep learned optical flow algorithm and non-linear MPC are jointly utilized to optimize a general task-specific objective. The optical flow dynamics are explicitly embedded into the MPC to generate policies which ensure the visibility of target features during navigation. An occlusion-aware moving target following controller is proposed in [13]. Here, metrics for target visibility are utilized to navigate towards a moving target and constrained optimization is leveraged to navigate safely through corridors. In the above works, the motion plans are generated only for a single aerial robot to track a single generic target. In our previous work [2], a non-linear MPC based formation controller for active target perception is introduced for target following. The controller assumes Gaussian observation models and linearizes system dynamics. Using these, it identifies a collision-free trajectory which minimizes the fused uncertainty in target position estimates. In contrast to that, in our current work we learn a control policy to explicitly improve the quality of 3D reconstruction of human pose. An implicit perception-aware target following behavior evolves out of the controller for both single and multi-agent scenarios.
Learning based Control for Aerial Robots for Perception Driven Tasks: Optimal control methods are computationally expensive, require explicit estimation of the state of the system and world, and depend mostly on hand-crafted system and observation models. Thus, it can often lead to suboptimal behaviors. A model-predictive control guided policy search was proposed in [14] where supervised learning is used to obtain policies which map the on-board aerial robot sensor observations to control actions. The method does not require explicit state estimation at test time and plans based on just input observations. In [15] authors used a deep Qlearning based approach for cinematographic planning of an aerial robot (or MAV). A discrete action policy was trained on rewards that exploit aesthetic features in synthetic environments. User studies were performed to obtain the aesthetic criteria. In contrast to that, our current work proposes single and multi-agent MAV control policies that reward the minimization of errors in body pose and shape estimation. A proximal policy optimization (PPO) based distributed collision avoidance policy was proposed in [16]. A centralized training and decentralized execution paradigm was leveraged to obtain a policy that maps laser range scans to non-holonomic control actions. In [17] the authors propose an A3C actor-critic algorithm to develop reactive control actions in dynamic environments. Each agent's ego observations and an LSTM-encoded dynamic environmental observations are inputs to a fully connected network. Their goal is to obtain a fully distributed control policy. In contrast to the aforementioned works, we propose a model-free deep reinforcement learning approach to the MoCap-aware MAV formation control problem. In our work, a policy neural network directly maps observations of the target subject to control actions of each MAV without any underlying assumptions of the observation model or system dynamics.

A. Problem Statement
Let there be a team of K MAVs (with quadcopter-type dynamics) tracking a person P. The pose of the k th MAV in the world frame at time t is given by where (x k t ) denotes the 3D position of the MAV's center in Cartesian coordinates and (Θ k t ) denotes its orientation in Euler angles. Each MAV has an on-board, monocular, perspective camera. It is important to note that the camera is rigidly attached to the MAV's body frame, pitched down at an angle of θ cam . The global pose of the person is given by and (Θ P t ) are the body's 3D center and global orientations, respectively. x P j,t denotes the 3-D position of a joint j from a total of fourteen joints considered for the MoCap of the subject. Ground truth joints considered are visualized as circles in Fig. 2. The MAVs operate in an environment with neighboring MAVs as dynamic obstacles. Their task is to autonomously fly and record images of the person using their on-board camera. The formation control goal of the MAV team is to cooperatively navigate in a way such that the error in 3D pose estimates of the subject is minimized.

B. Formulation as a Sequential Decision Making Problem
Intuitively, the accuracy of aerial MoCap depends on the following two factors.
• The subject should always remain completely in the FOV of every MAV's camera, occupying maximum possible area on the image plane. • The subject is visually encapsulated from all possible directions (viewpoints). Based on these intuitions and experimentally derived models for single and multiple camera-based observations, in our previous work [2] we approached this problem using a model predictive control (MPC) based formation controller. The MPC objective was to keep a threshold distance to the subject while satisfying constraints that enable uniform distribution of viewpoints around the subject. Additionally, a yaw controller ensured that the subject is always centered on the image plane. As discussed in the introduction, this method is hard to generalize because to i) it is agnostic to how the 3D pose and shape was estimated by the back end, and ii) it needs carefully derived observation models.
To address these issues in this work we take a deep reinforcement learning-based approach. We model this formation control problem as a sequential decision making problem for every MAV agent. Dropping the MAV superscript k, for each agent the problem is defined by the tuple (S, O, A, T, R), where S is the state-space, O is the observation-space, A is the action-space, T is the environment transition model, and R is the reward function. At each time instance t, an agent at state s t has access to an observation o t using its cameras and on-board sensors. The agent then chooses an action a t , which is conditioned on o t using a stochastic policy π θ (a t |o t ). θ represents parameters of a neural network. The agent experiences an instantaneous reward r t (s t , a t ) from the environment indicating the goodness of the chosen action. We approach the problem without any underlying assumptions or knowledge about the environment transition model T . To this end, we leverage a model-free deep reinforcement learning method to train the agents. We will further describe the states, observations and actions in detail. Due to ease of notations and to keep the RL training computationally tractable, we will consider 2 MAV agents in this letter, i.e, K = 2. Rewards are described later when we discuss our proposed methodology in sub-section III-C.
The observation vector o t is given by (1). Its first two components are the measurements of the person P 's position and velocity made by the agent in its local Cartesian coordinates. This is given by [y P tẏ P t ] ∈ R 6 . The third component of the observation vector is the measurement of the relative yaw orientation of the person with respect to the robot's global yaw orientation, denoted by ψ P t . Here we emphasize that we make no assumptions regarding the uncertainty model associated with these measurement. However, we assume that this measurement is available using a vision-based detector or similar. In our synthetic training environment we directly use the available ground truth position and orientation of the person and the MAV to compute these measurement. In real robot scenarios we use Vicon readings to calculate it. The fourth component is the 3D position measurements to the neighboring MAV agent in the local Cartesian coordinates of the observing agent. This is given by y N t ∈ R 3 . The fifth component is the measurement of the relative yaw angle orientation of the person with respect to the neighboring robot's global yaw orientation, denoted by ψ P,N t .
2) Actions: Action a t is sampled from the control policy π θ (a t |o t ) for an input observation o t . In our formulation, actions consist of egocentric 3-D linear translational velocity of the agent, given by v t = [vx t vy t vz t ] and a rotational velocity ω t about its z-axis. The chosen action defines a way-point for the agent in the world frame. {x w t , φ w t } is provided to low-level geometric tracking controller (Lee controller) [18] of the agent. x t , as defined before, denotes the current 3D position of the agent. (2)

C. Proposed Methodology
Training multiple agents to achieve multiple objectives is a complex and computationally demanding task. In order to have a systematic comparison we first develop our approach for a single agent case and then for multi-agent scenario. Meaning, we train (and then evaluate and compare) two different kinds of agents, and hence, networks. These are i) a single agent with only MoCap objectives, and ii) multi-agents (2 in our case) with both MoCap and collision avoidance objectives.
We hypothesize that using the first kind of network an agent will learn to follow the person and orient itself in the direction of the person in order to achieve accurate MoCap from the back-end estimator. On the other hand, using the second network, the agents will learn how to avoid each other and distribute themselves around the person to cover all possible viewpoints. We also hypothesize that the best navigation policies for the robot(s) for the MoCap task should significantly depend only on the MoCap's accuracyrelated rewards, while other rewards may or may not be required.
1) Network 1: Single Agent Network: All variants of single agent network use the following states and observations, where the superscript 1 denotes single agent network.
The actions for all single agent network variants consist of a t as stated in (2). They are all trained on a moving subject. These variants differ only in their reward structure as described further. The rewards are computed at every timestep. However, for sake of clarity we drop the subscript t from the reward variables. a) Network 1.1 -Only Centering Reward: In this variant we only reward the agent based on the intuitive reasoning of keeping the person as close as possible to the center of the image from the MAV agent's on-board camera. It is calculated as follows.
where d px is the distance between the center of the person's bounding box on the image to the image center, measured in pixels. c 1 = 0.01 is a weighting constant. Note that keeping the person centered in each frame is not the goal of this work. As per the above-stated hypothesis, centering reward may not be required at all. Thus, Network 1.1 will only serve as a comparison benchmark to highlight that a MoCap's accuracy-related reward is explicitly required. b) Network 1.2 -SPIN Reward: In this variant of the network we reward the agent based on the output accuracy of the MoCap back end. For this, we use SPIN [8], a state-ofthe-art method for human pose and shape estimation using monocular images. At every time-step of training, we use SPIN on the image acquired by the agent and compute an estimate ofx P j,t ∀j; j = 1 · · · 14 corresponding to all 14 joints. In the synthetic training environment we have access to the true values of these joints, denoted by,x P j,t ∀j; j = 1 · · · 14. SPIN reward is then given by where d J = 1 14 14 j=1 (||x P j,t −x P j,t || 2 ) and c 2 = 5 is a weighting constant.
c) Network 1.3 -Weighted SPIN Reward: Network 1.2 rewards the agent equally for the accuracy of each joint. However, the joints further away from the pelvis (also mentioned as the root joint), like hands or foot, have a greater tendency to be in an erratic motion than the ones closer to the root, like hips. To account for this, in the network variant 1.3 we penalized the outward joints more and hence define a Weighted SPIN reward as, where d W = 1 14 14 j=1 (w j ||x P j,t −x P j,t || 2 ) and w j s are positive weights that sum to 1. d) Network 1.4 -Centering and Weighted SPIN Reward: The last variant of the single agent uses a summed reward given as r sum = r center + r WSPIN .
2) Network 2: Multi-Agent Network: All three variants of the multi agent network, described below, use the state as defined in (1). The observations for Network variants 2.1 and 2.2 are equal to (1) withoutẏ P t as these variants are trained on a static subject. In these two variants the action space excludes yaw control. Hence during their training, we use a separate yaw controller to always orient the agent towards the person. On the other hand, Network 2.3 is trained with the full observation space as stated in (1) on a moving subject, and it uses the full action space is as stated in (2). Meaning, Network 2.3 also includes yaw-rate control.
The difference in the reward structure is described below. a) Network 2.1: Centering, collision avoidance and Al-phaPose Triangulation Reward (Trained with Static Subject): In this variant we use a sum of three rewards r center , r col and r triag . Here, r center is same as defined in (4). r col rewards avoiding collisions by penalizing based on the distance from the neighboring robot. It is computed as where x thresh = 3m in our implementation. r triag is a simplified MoCap-specific reward in a 2agent scenario, which we obtain using a triangulation-based method. AlphaPose [19] is a state-of-the-art human joint detector which provides body joint detections on monocular images. At every time step we use it on the images obtained by the agent and its neighbor to obtain o j,t ∈ R 14 and o j,t ∈ R 14 , respectively. Using known camera intrinsics and extrinsics (from self-pose estimates) for both agents, a point in the image plane and its corresponding view from another camera, we can estimate the 3-D position of the point using a least squares formulation (equation (14.42) in [20]). Therefore, by using o j andō j,t , we estimate the 3D positions of all 14 joints of the subject asx P j,t ∀j; j = 1 · · · 14 and compare it to ground-truth joint positionsx P j,t ∀j; j = 1 · · · 14. Thus, r triag is given by where d triag = 1 14 14 j=1 (||x P j,t −x P j,t || 2 ). b) Network 2.2: Centering, collision avoidance and Multiview HMR Reward (Trained with Static Subject): In this variant we use a sum of three rewards r center , r col and r MHMR . The first two are same as (4) and (7), respectively. r MHMR rewards the agent based on the output accuracy of the MoCap back end using images from multiple agents. For this, we use MultiviewHMR [9]. It is a state-of-the-art method for human pose and shape estimation using images from multiple viewpoints. At every timestep of training, we use it on the image acquired by the agent and its neighbor to compute an estimate ofx P j,t ∀j; j = 1 · · · 14 corresponding to all 14 joints. The reward is then given by where d mhmr = 1 14 14 j=1 (w j ||x P j,t −x P j,t || 2 ) and the weights are as described in the previous section.
c) Network 2.3: Centering, continuous collision avoidance and Multiview HMR Reward (Trained with Moving Subject): In this variant we use a sum of three rewards r center , r concol and r MHMR . Here r center and r MHMR are same as (4) and (9). The continuous collision avoidance reward is given as follows.
where d lthresh = 1.0m and d hthresh = 20m. v pot is obtained using the potential field functions as described in our previous work [21] (equation 3). Furthermore, the value of v pot is clamped to 1. d) Network 2.4 + Potential Field: Centering and Multiview HMR Reward (Trained with Moving Subject): In this variant, we use a sum of two rewards, namely, r center (4) and r MHMR (9). The key difference in this case w.r.t. Network 2.3 is that here we use a potential field-based collision avoidance method [21] as a part of the environment during the training to keep the robots from colliding with each other at all times. It is not embedded in the reward structure and hence, the robots are not explicitly penalized for it. Testing of this network, during experiments, was also performed with potential field-based collision avoidance as a part of the environment.

1) Training Setup in Simulation:
We train and our networks in simulation. We use Gazebo multi-body dynamics simulator with ROS and OpenAI-Gym to train the MAV agents. For the MAV agent we use AscTec Firefly model with an on-board RGB camera facing down at 45 • pitch angle w.r.t. the MAV body frame. We run 5 parallel instances of Gazebo and the Alphapose network on multiple computers over a network to render the simulation. The policy network is trained on a dedicated PC which samples a batch of transition and reward tuples from the network of computers to update the networks. We use a simulated human in Gazebo as the MoCap subject and generate random trajectories using a custom plugin. Details of the network architectures, training process, libraries, instructions on how to run the code, etc., are provided in the attached supplementary material.

A. Simulation results
In this sub-section we evaluate our trained policies in Gazebo simulation environment. We create a test trajectory for the simulated human actor for 120s on which it walks with varying speeds. The best policy of each network variant, as described in subsection III-C, is run 20 times while the actor walks the trajectory. Thus, results from a total of 2400s of evaluation run of each network variant is obtained.
For single agent experiments, in addition to the DRL-based methods, we run 4 other methods: i) 'Network 1.4 + AirCap', ii) Orbiting Strategy, iii) Frontal-view Strategy and iv) MPCbased approach [2]. For multi-agent experiments we run 2 additional methods: i) 'Network 2.3 + AirCap' and ii) MPCbased approach [2]. All these were also run 20 times for 120s each to allow comparison with our DRL-based policies. 'Network 1.4 + AirCap' and 'Network 2.3 + AirCap' imply running the networks with 'true observations' instead of directly using simulator-generated ground-truth observations. To this end, we ran the complete AirCap pipeline [2] during the test by replacing only the MPC-based high-level controller with the DRL policy in it. It executes an NN-based person detector, a Kalman filter-based estimator for person's 3D position estimation (not orientation), cooperative selflocalization of the MAVs using simulated GPS measurements with noise as well as communication packet loss. More details regarding this are provided in the supplementary material associated with this article. 'Orbiting Strategy' is essentially a 'model-free' approach in which a robot orbits around the person at a fixed distance in order to increase the coverage. In 'Frontal-View Strategy' a robot maintains a fixed distance to the person and attempts to always keep the frontal view of the person in the camera image. Below we discuss the results for single and multi-agent network variants and other aforementioned methods.

1) Single Agent Network Variants:
In order to compare the network variants, we use 2 metrics, i) centering performance error (CPE) and ii) MoCap performance error (MPE). CPE is computed as the pixel distance from the center of the bounding box around the person in the agent's camera image to the image center. MPE, for single agent networks is simply d J , as defined for the reward in (5). To compute this, the SPIN method [8] is run on the images acquired by the agents during testing. Note that the metric which quantifies the MoCap accuracy of any method in this paper is MPE (the right side box plots in Fig. 5 and 6). CPE is a metric that we plot only to make the policy performance intuitively explainable and understand 'what' the learned RL policies are doing to achieve a good MPE. Figure 5 shows the error statistics of the aforementioned metrics. The grey background behind any box plot signifies that the method could not keep the person, even partially, in the MAV FOV, thereby completely losing him/her, for at least some duration of the experiment runs. In these cases, the box plot represents errors computed only for those timesteps when the person was at least partially in the FOV.
MPE plots in Fig. 5 for single robot experiments show that for all methods the medians of the MPEs are very similar to each other. This is the most significant result, especially because we can demonstrate that in terms of accuracy our DRL-based approach is on par with the state-of-the-art MPCbased approach [2] (or fixed-strategy methods), without the need for hand-crafting observation models and system dynamics (or pre-specified robot trajectories). Furthermore, the MPE for network 1.4 and 1.2 also has significantly less variance of MPE compared to all other methods. Due to these reasons, Network 1.4 and Network 1.2 are the two most successful approaches for the MoCap task.
From Fig. 5 plots, we also see that Network 1.4 keeps the person centered much more than Network 1.2, 1.3 or MPC. This is expected because Network 1.4 is rewarded for centering the person in the image in addition to SPIN-based MoCap rewards. Network 1.2 or 1.3, on the other hand, only has SPIN-based MoCap rewards. Nevertheless, the MPE of Network 1.4 is only slightly better than that of Network 1.3. This signifies that centering the person in the image does not have a great impact on the accuracy of the motion capture estimates. Network 1.1, which often lost the person in its FOV, outperforms all other methods in its CPE performance for the duration it could 'see' the person. This is expected as it is trained with only centering reward. Even though its MPE mean for the person-visible duration is similar to other networks, the variance of its MPE is higher than the other networks. Moreover, the fact that it could keep the person in FOV only 76% of the time as compared to 100% for other networks (1.2-1.4) makes it less desirable even for the MoCap task.
The median MPE of 'Network 1.4 + AirCap' is very similar to all other methods. However, it should be noted that there is one drawback in 'Network 1.4 + AirCap'. As the 'ground truth observations' are not used in this method and the simulated person can rapidly make sudden direction changes, the person is much more susceptible to go out of the FOV of the MAV's camera. Since the network never learned to 'search' for the person who is out of the FOV, the method has to 'wait' until the person walks back in the FOV. The cooperative estimation method of the AirCap pipeline helps in this regard as the person might still be in another robot's FOV. For a single robot case this is also not possible. Thus, 'Network 1.4 + AirCap' loses the person for 35% of the time.
The strategy-based methods struggle to keep the person, even partially, in the MAV camera's FOV. While the 'Orbiting Strategy' was able to keep the person in the FOV for 73% of the total time of all experiments combined, the 'Frontal-View Strategy' managed to do that only 20% of the total time. This is because when the person changes his direction or speed of motion, the robot could fly around to reposition itself in the front of the person, thus losing him during the transition. On the other hand, our successful DRLbased approaches, i.e., Network 1.2, 1.3 and 1.4, never lose the person from the camera FOV. Based on this analysis, we can conclude that the strategy-based methods, while being 'model-free', still have a major drawback of losing the person often, if not very carefully hand-crafted. Our DRL-based approaches 'explore' the space of these strategies and finds the most suitable one in their policies.
2) Multi-Agent Network Variants: The MPE in the multiagent case is also simply d J , as defined for the reward in (5), but instead of using SPIN as in the single agent case, here it is computed by running Multiview HMR [9] for pose and shape estimation on every simultaneous pair of images acquired by both the agents during the evaluation runs. Network 2.1 and 2.2 were trained and tested on a static person. On the other hand, Network 2.3 and Network 2.4 + Potential Field were both trained and tested with a moving person (in the same way as for the single agent experiments). The remaining two methods in the multi-agent case were also tested with moving persons. Figure 6 shows the error statistics of multi-agent simulation experiments. The best performing network in multiagent case is Network 2.3. It is very similar to the MPCbased method in terms of the MPE median value (See Fig. 6 right side) and has much less MPE variance than MPC. This is a very significant result as MPC required observation models of the subject and our DRL-based approach in Network 2.3 did not. In the MPC approach, the viewpoint configurations for the MAVs emerge out of the joint target perception models. In contrast, in the DRL-based approach the MAVs directly learn the viewpoint configurations from experience. We also notice that the rewards based on a triangulation method assist, to some extent, in achieving acceptable MoCap performance (see results of Network 2.1). However, they remain inferior to the Network 2.3, which used the sophisticated approach taken in Multiview HMR [9] for reward computation.
Furthermore, we find that in terms of MPE, 'Network 2.3 + AirCap' is close to both Network 2.3 and MPC. Similar to 'Network 1.4 + AirCap', the 'Network 2.3 + AirCap' also loses the person from the robots' FOV. However, it is present in at least one robot's FOV for approx. 97% of the total experiment duration. The increased visibility in the multirobot case is due to the cooperative estimator module of AirCap pipeline. This assessment signifies the usability of our method in real robots with real observations. Next, we find that the policy learned by 'Network 2.4 + Potential field' was able to achieve MPE median value comparable to Network 2.3 but at the cost of slightly higher MPE variance and loss of person from at least one robot's FOV for several periods (13% of total duration). This experiment further signifies the key benefit of our DRLbased approach in Network 2.3. It overcomes the need for knowing models, strategies as well as any ad-hoc collision avoidance techniques. In Network 2.3 the learned policy not only achieves good MoCap performance, but it also naturally learns to avoid collisions with the teammates. In the video associated to this paper (also available herehttps:// youtu.be/07KwNjc7Sy0) we show how well Network 2.3 performs. The networks for the moving person, however, did not ensure very good centering of the person in the image (see the left side of Fig. 6) as compared to the MPCbased approach. Despite this, their MPE performances are only slightly poorer than MPC (MPE median difference is approx. 0.05m only). This further signifies that centering the person on the image has a very low effect on MoCap performance.
Finally, for the multi-agent case, we find that the medians of the MPEs for all multi-agent networks were substantially lowered compared to the MPEs obtained by single-drone experiments (from ∼ 0.7m to 0.22m). This highlights the benefit of using multiple drones and hence multiple views to improve MoCap performance.

B. Real Robot results
In order to validate our approach in a real robot scenario, we used a DJI Ryze Tello drone. It consists of a forward looking camera capturing images at 30 hz. The drone is controllable using an SDK with ROS interface. Tello has the functionality of vision-based localization, which is highly inaccurate. Hence, we performed experiments within a Vicon hall with markers on top of the drone to estimate its position and velocity. The tracked subject wore a helmet with Vicon markers. Vicon-based position estimate of the person was used to compute the observations for the neural network.
We performed experiments with 1 Tello drone and compared our DRL-based approach using Network 1.1 with stateof-the-art MPC-based approach [2]. These were performed for approximately 400s and 700s, respectively. Figure 7 shows an external camera footage of the experiment and the on-board drone view with pose and shape overlay using SPIN. As the ground truth pose and shape of the human subject in real experiment is not available, we only compare the following criteria. We compare i) the length and breadth of the bounding box around the person in the drone images, and ii) proximity of the person to the center of those images, calculated as pixel distance from the image center to the center of the bounding box around the person. The bounding boxes are computed by running Alphapose [19] method on the images recorded by the drone. Figure 8 presents the statistics of these evaluation criteria. We notice that the performance of both approaches is similar in terms of the person's proximity to the image center, with our DRL-based approach performing slightly better. However, we observe that the MPC-based approach is consistently able to keep a larger size (projected height) of the person in the images. This is due the fact that the MPC's objectives enforce it to keep a certain threshold distance to the person. As the DRLbased approach has no such incentive, it varies its distance to the person more, therefore causing a greater variance in the projected height of the person. On the other hand, this enables our DRL-based approach to change its relative orientation with respect to the person such that she/he is is observed from several possible sides. This is evident by the greater variance in the projected width of the person on the images. This property of our DRL-based approach will benefit pose and shape estimation methods, as demonstrated in the simulation experiments.

V. CONCLUSIONS AND FUTURE WORK
In this letter, we presented the first deep reinforcement learning-based approach to human motion capture using aerial robots. Our solution does not depend on hand-crafted system or observation models. Formation control policies are directly learned through experience, which is obtained in synthetic training environments. Through extensive experiments and comparisons we find that single agents learn extremely good policies, on par with carefully designed model-based or model-free methods. They even generalize to real robot scenarios. We also find that multiple agents learn even better policies and outperform single agents in performing MoCap Our approach would also be applicable in a real robot setting with 'real observations' while achieving accuracy similar to an MPC-based approach [2]. Nevertheless, this is valid only for those durations when the person is not lost from the FOV of all cameras. In order for the policy to 'search' for the person, network training should be done with the AirCap pipeline's 'real observations'. This would involve massive parallelization, running several DNN-based detectors and keeping track of delayed measurements. Furthermore, our approach is limited in terms of scaling up to more agents. While addressing this will require more sophisticated network architecture, it should be noted that 2 to 3 aerial robots may be enough to achieve a good MoCap accuracy. This was shown in one of our recent works [1]. We also intend to improve the training process by using photorealistic scenes and body models in richer environments.