Vision-Based Formation Control for an Outdoor UAV Swarm With Hierarchical Architecture

Formation control of a UAV swarm is challenging in outdoor GNSS-denied environments due to the difficulties in accomplishing relative positioning among the UAVs. This study proposes a vision-based formation control strategy that could be implemented in the absence of an external positioning system. The hierarchical architecture has been constructed for the UAV swarm using the modified leader-follower strategy. The leader UAV derives and broadcasts the locations of the follower UAVs, while the follower UAVs calculate their control inputs to achieve the desired swarm formation. The vision-based localization of the UAVs is accomplished using state-of-the-art deep learning algorithms like YOLOv7 and DeepSORT. The RflySim-based simulation has been conducted to verify the feasibility of the conceptualization, and the validation has been made with a real flight test using a swarm comprised of five quadrotors. Results show the robustness of the vision-based UAV positioning framework with a localization error within 0.3 m. Moreover, the formation control without the GNSS is achieved with a monocular camera and the entry-level AI platforms implemented onboard, which could be promoted to UAV swarm applications for broader scenarios.


I. INTRODUCTION
A UAV Swarm is a system composed of multiple autonomous UAVs that can be used to accomplish collective tasks such as combat operations [1], search and rescue [2], smart agricultural [3], etc [4]. These tasks could be carried out more effectively by the UAVs as a group rather than accomplished by each individual UAV alone. The superior capabilities represented by the UAV swarm, including the extended scalability, flexibility and robustness [5], are usually expressed as the swarm intelligence. Formation control is one of the important research topics to achieve swarm intelligence [6]. For UAV swarm systems with massive UAVs, how to realize desired formations in a robust way is of both theoretical challenge and engineering significance, and many control strategies have been proposed [7].
The associate editor coordinating the review of this manuscript and approving it for publication was Muhammad Sharif .
In order to autonomously construct the desired formation, a swarm of UAVs depends highly on the Global Navigation Satellite System (GNSS), such as GPS and Beidou, to maintain their relative locations in the formation. However, the GNSS signal can be affected by environmental conditions or be obscured by suppressive jamming. [8] These situations would not only disable the swarm formation maintenance but also make the individual UAV impossible to navigate during flight. Therefore, an auxiliary or redundant localization approach is needed to ensure the UAV formation control when it fails to receive the GNSS signals, which constitutes the main scope of this article.
Current successful positioning systems other than the GNSS for UAV swarms include collaborative simultaneous localization and mapping (C-SLAM) [9], Ultra Wide Band (UWB) [10], [11], motion capture (MoCap) systems [12], and many others [13]. However, most of these systems are intended for indoor applications. When extending these systems for outdoor applications, the enlarged flight altitude  and area for UAV swarms demand substantial upgradation of the corresponding sensors and infrastructures, which could be impractical to accomplish considering weight, cost, and power consumption. As a result, new frameworks for lighter, cheaper, and onboard localization solutions applicable for outdoor UAV swarms are of great importance.
Providing vision-based solutions to tackle the localization problem for formation control of an outdoor swarm constructed with small UAVs is challenging. The difficulties behind this sort of solution are majorly threefold. Firstly, outdoor UAVs usually appear as small targets when observed distantly, and the changeable background influenced by daylight, landform, and vegetation tends to undermine successful detection. Secondly, UAVs in a swarm usually have similar appearances which are hard to differentiate, which might generate confusion when visually tracking them. Thirdly, outdoor applications would prefer the vision-based framework to be implemented onboard, which requires the sensors selected for localization should be operatable for small UAVs. In this study, a vision-based formation control strategy is proposed for outdoor small UAV swarms. As shown in Fig.1, a heterogeneous swarm with hierarchical architecture is constructed. One leader UAV is selected to visually detect, track, and localize multiple follower UAVs within the visual monitor area using a monocular camera mounted downwardly. The follower UAVs receive their relative locations broadcast by the centralized leader UAV and perform the formation controls in a decentralized way. The ground control station could send commands to the swarm through the datalink and the leader UAV could send its sensed data back. The contributions of this work are summarized as follows: 1) A vision-based localization strategy for the UAV swarms in outdoor scenarios is proposed which applies to the GNSS-denied conditions. State-of-the-art deep learning algorithms such as YOLOv7 and DeepSORT have been adopted for the visual detection and tracking of multiple drones. The positioning information for individual UAVs is then derived and delivered among the swarm for formation control. The precision of the visual localization has been analyzed in the simulation environment with different scenarios. 2) A heterogeneous swarm with hierarchical architecture has been developed and validated in a real flight. The leader UAV with a monocular camera is allocated the task of providing localization information for the follower UAVs. In contrast, the follower UAVs achieved the designated formation with their individual control algorithms when their GNSS is disabled. The swarm formation control strategy in this study is centralized in its positioning mechanism and decentralized in its control mechanism.
3) The vision-based formation control framework has been implemented onboard with the entry-level AI platforms, where the deep learning algorithms accomplish the location information in real time. Instead of equipping each UAV in the swarm with visual sensors, only one monocular camera mounted on the leader UAV is indispensable. It shows that the proposed framework could be promoted in a more generalized situation with a relatively lower cost.
The remainder of the article is arranged as follows. Section II summarizes the related work. Section III introduces the methodology, where the general framework is given first, and the subsections of detection, tracking, localization, and formation control strategy are then presented in sequence. Section IV describes the details when implementing the simulation, the experiment, and data preparation. Results from the simulation part and the experimental part are presented in Section V. Discussion is given in Section VI. Finally, the conclusions are summarized in Section VII.

II. RELATED WORK A. VISUAL DEFECTION OF OUTDOOR UAVs
Visual detection of UAVs is a specific application of the object detection technique in computer vision (CV). Zou et al. [14] classified the object detectors in the last two decades as traditional detectors and Deep Learning (DL) based detectors. For instance, [15] used traditional detectors for outdoor UAV detection. The major limitation of the traditional detectors is their poor image processing capability for big data, while visual features of the outdoor UAVs in the distance could only be better captured in real-time videos or image streams with larger resolutions. Traditional detectors have limited capabilities but they gave inspiration to the design of more advanced DL detectors.
DL has been an overwhelmingly successful technique in CV, which has already demonstrated its accuracy and reliability on vision-based applications such as UAV localization [16], UAV remote sensing [17], UAV autonomous navigation [18], etc. It takes advantage of the diversified convolutional neural network (CNN) structures to abstract the visual features from big data automatically and efficiently [19]. Moreover, DL has the powerful ability for generalization [20], making it more adaptable for the changing background scenes when the UAV features have been properly learned.
The original DL-based detectors are two-stage detectors since they generate potential bounding boxes and run a classifier on these proposed boxes in separate steps. Studies using these detectors for UAV detection include [21] and [22]. In contrast, the recently emerged one-stage DL detectors framed the object detection as a regression problem so that the two-stage tasks could be completed simultaneously. Many studies use one-stage detectors such as YOLO [23] and Cen-terNet [24] for UAV detection. However, the upgrade of these detectors is so fast that the detection performance has been subversively improved. This study utilizes the state-of-theart DL detector YOLO v7 [25] for UAV detection in outdoor swarm applications and satisfying results have been achieved.

B. VISION-BASED TRACKING OF MULTIPLE UAVs
The multi-object tracking (MOT) or multi-target tracking (MTT) problem is another hot topic in DL-based CV. [26] UAVs in the same swarm usually share a similar appearance visually and tracking them simultaneously is a tough task due to the various occlusions and interactions among them during formation construction. Previous studies [27], [28], [29], [30] use artificial markers to tag different UAVs so that each UAV has its unique visual feature to be identified. UAV tag systems with artificial markers have been established using circular patterns composed of concentric colored circles [27], blinking ultraviolet markers with encoded patterns [28], continuous and pulsating lights [29], and an active infrared LED target with coded arrangement [30]. It turns out that these tag systems could not only identify each UAV, but could also be used to estimate the UAV's attitude and distance when properly calibrated.
Although artificial markers are successful when tracking multiple similar UAVs, their limitations for specific scenarios are also not negligible. In the first place, installation of additional tag hardware might be not practical, especially for some small UAVs considering their limited spatial capacity. In addition, some tag systems emitting light tend to become an easy target in a hostile environment which could not be implemented in stealthy tasks. As a result, visual tracking of multiple markerless UAVs is of considerable importance.
Recent studies solve the MOT/MTT problem for multiple marker-less UAV tracking by using classic filters such as the Kalman filter [31], the JPDAF (joint probabilistic data association filter) [31], [32], and the PHD (Probability Hypothesis Density) filter [31], [33], [34]. It is found that these filters are good at alleviating errors due to sensing noise and false positive detection to some extent but are largely ineffective against false negatives [34]. At the same time, the DL-based MOT/MTT solutions have made great progress in successfully tracking a large number of pedestrians or vehicles [26]. Especially for the methods featured as tracking-by-detection (object detection process is carried out separately before tracking) such as SORT [35] and DeepSORT [36], which have already been used for tracking tasks of UAV applications [37], [38]. In this study, we take advantage of these state-of-the-art DL tools, the combination of YOLOv7 and DeepSORT, to explore a robust vision-based framework for UAV swarm formation control.

C. ONBOARD VISION-BASED FORMATION CONTROL FOR A SMALL UAV SWARM
As a disruptive technology, UAV swarm has evolved with an extensive diversity varying from centralized to decentralized (whether there exists a global coordinator in the communication architecture) [39]. Previous studies implementing vision-based formation control usually assume that the GNSS and even the radio signals are disabled, in which case UAVs in the swarm formation are all self-sufficient in visual localization. Consequently, this assumption requires each UAV equipped with homogeneous sensors and the same powerful processors for visual data processing. The practical problem is that in order to observe all the neighbor UAVs, only omnidirectional or fish-eye cameras are suitable to be mounted on each UAV, which is hard especially for small UAVs with limited payload. In addition, detecting and tracking multiple UAVs in different directions is computation-demanding and the current AI platform can hardly support such powerful algorithms when the UAV number is large. Finally and most importantly, complicated formation construction requires the whole swarm to reach a consensus, which is almost impossible without an effective communication mechanism among the swarm.
In this study, we relax the rigorous assumptions about the absence of radio communications, making the swarm centralized in positioning system but decentralized in formation controller distribution. As shown in Fig.1, the swarm is heterogeneous while only the leader UAV is enabled to visually detect and track the other UAVs. The follower UAVs receive their relative locations and perform the formation controls in a decentralized way. A similar topology can be found in [40], while the current study focuses mainly on putting the visionbased formation control framework into practice. Although the acquisition of situational information is not enabled for the follower UAVs in this study, various sensors could be equipped on them to achieve extended reconnaissance capability in future applications.

A. OVERVIEW
In this study, we construct a heterogeneous swarm where one leader UAV is acting as the coordinator for the rest of the follower UAVs. Fig.2 presents an overview of the proposed vision-based framework for formation control. The leader UAV is majorly composed of a flight controller, multiple sensors, an image processor, a datalink, and a broadcaster.
The visual sensor could be a monocular camera, and the leader UAV could utilize vision-based localization strategies such as feature point matching, template matching, visual odometry, and deep learning to determine its position without the GNSS. Sensors on the leader UAV not only support autonomous navigation but also frequently monitor the area of interest and the formation of the follower UAVs. The image processor deals with the detection, tracking, and localization tasks in real-time and distributes the locations for the rest of the UAVs by the broadcaster. The ground station has bothway communication with the leader UAV through the datalink and its commands could also be sent to the follower UAVs relayed by the leader UAV in a simplex way. In contrast, each follower UAV takes a flight controller, indispensable sensors, a formation controller, and a receiver. Compared with the leader UAV, the follower UAVs do not need cameras or image processors onboard unless required by the task. After receiving the locations for all the follower UAVs, they could get their own formation control input calculated by their own formation controller. They only need basic sensors such as the altimeter, the compass, and the inertial measurement unit (IMU) to execute the formation control commands.
The hierarchical swarm actually has a centralized framework, which is subject to overall incapacitation with a malfunctioning leader UAV. Three solutions could be applied to handle this emergency situation. Firstly, reallocate a new leader UAV from the follower UAVs. This solution requires preparing some follower UAVs as backup leaders and having their mission reallocated as soon as the original leader lost its capability. Dynamic task reallocation for this scenario could be referred to Tang et al. [41]. Secondly, multiple leaders could be used to solve the problem from the perspective of redundant design. In this case, as long as one leader UAV is properly on duty the swarm could continue to operate. A similar idea could be referred to Wang et al. [42]. Thirdly, the follower UAVs could be equipped with visual-based redundant navigation systems. Once the leader UAV failed to broadcast the positioning information, the follower UAVs could return to the base on their own account using visionbased strategies as mentioned by Warren et al. [43].
The algorithm architecture is presented in Fig.3 with four major modules as detection, tracking, localization, and formation control, where the input is the video stream for the leader UAV and the output is the velocity command generated by the follower UAVs. This way of module division is consistent with the titles of the following subsections, which could be regarded as a summary of the methodology of this study. The proposed heterogeneous framework and algorithm architecture manifest its simplicity since the follower UAVs need to carry much less payload onboard. Moreover, even the formation controller can be coded together with the flight controller. Compared with the framework in [40], the consensus control of the follower UAVs is now a problem for the centralized leader UAV, so communication between the follower UAVs is now dispensable. In the rest of this section, we present the details of this framework by focusing on the modules in the image processor on the leader UAV and the formation controller embedded in the follower UAVs.

B. DETECTION
YOLO (You Only Look Once) is a state-of-the-art object detection algorithm that has been continuously optimized by the community of DL-based object detection. This study uses the newer version YOLOv7 [25] as the detector for the follower UAVs. For information about the development from the original YOLO to YOLOv7, [44] is recommended which reviews them in detail.
Details about the network structure of YOLOv7 could be referred to [25] and [45]. YOLOv7 mostly inherits from YOLOv5 [46], including the overall network architecture, configuration file settings and training, inference, and verification processes. YOLOv7 introduces the E-ELAN (Extended-Efficient Layer Aggregation Networks) module for the network structure. It uses convolution to increase the feature base, then completes the feature shuffle operation, and finally merges different convolutional layers' output feature maps to improve the network's ability to extract more diversified image features. At the same time, YOLOv7 uses model re-parameterized technology to fuse the convolutional layer with the batch normalization layers. These layers are merged into one module during the inference process to improve the network inference speed while retaining the detection performance. Reference [25] YOLOv7 uses deep supervision for deep network training, which adds additional auxiliary heads in the middle layer of the network. At the same time, the weight parameters from the loss-optimized network are spared from the inference. By allowing the auxiliary head to directly learn from the image, the main network is spared to pay more attention to new features, and thus learning ability of the network is improved.
In order to select a YOLOv7 network suitable for onboard detection in real-time, we compared the YOLOv7-tiny, YOLOv7, and YOLOv7-X in this study. The essential difference between them is that the depth and width of the three network increase in an incremental manner. The number of parameters for the YOLOv7-tiny, YOLOv7, and YOLOv7-X network models is 6.03 million, 37.2 million, and 70.9 million, respectively. Reference [25] YOLOv7 has been tested on the MS COCO (Microsoft Common Object in Context) dataset, which is one of the most popular large-scale object detection, segmentation, key-point detection, and captioning dataset available for public use. Reference [47] It shows that in comparison with YOLOv5, YOLOv7 has fewer parameters, less amount of computation, and faster inference time, while the mean Average Precision (mAP) of YOLOv7 is also a bit higher [25] The mAP is a generalized metric to measure the performance of detection models. All categories' AP values are   averaged to obtain the mAP for a specific method. AP is the area under the P(R) curve: where Precision (P) and Recall (R) are defined as: TP (True Positive), FP (False Positive), FN (False Negative) respectively correspond to objects correctly detected, backgrounds mistakenly detected as an object, and objects not detected.

C. TRACKING
Simple Online and Realtime Tracking (SORT) [35] is a tracking-by-detection algorithm that solves multiple object tracking problems. It is famous for its efficient association of objects for online and real-time applications. Key components of SORT include detection, object state propagation, ID association, and lifespan management of tracked objects. Reference [26] It uses the Kalman Filter for its motion model to estimate and propagate target identities within frames. Hungarian algorithm is used for ID assignment according to the intersection-over-union (IoU) distance between each new detection and the predicted bounding boxes of the existing targets. Its simplicity and efficiency make it suitable for onboard UAV target tracking scenarios.
DeepSORT [36] is used in this study to improve the performance of the SORT. Our previous study [38] found that SORT has a deficiency in yielding a relatively high number of identity switches when tracking through occlusions. This problem is mitigated by DeepSORT, which uses CNN as a more informed association metric combining motion and appearance information. DeepSORT introduces Mahalanobis distance in the motion model and utilizes the cosine distance of the appearance descriptor to improve its performance. [36] With decreased failures from occlusions, we use DeepSORT for the online multi-UAV tracking scenarios.
In this study, we use a markerless identity correction algorithm to identify each follower UAVs. Its operation is dependent on the tracking results from the initial frames of the video or image stream. We assume that the initial IDs of the follower UAVs are known to the leader UAV when they are captured in the initial frame. For instance, the follower UAVs are hovering in a line with their IDs known in the spatial arrangement, waiting for signals to operate until the leader UAV successfully tracked them. The bounding boxes are assigned the predetermined IDs according to their sorted box locations, which will be used to replace the random IDs allocated by DeepSORT in the subsequent frames. It is found that this trick generates satisfying results thanks to the robustness of the YOLO and DeepSORT. In addition, this trick has unexpected benefits in that some FP targets could be effectively eliminated since their transient appearance later on get no ID assigned in the first place.

D. LOCALIZATION
A localization model with consideration of the perspective relationship is constructed here in three-dimensional (3-D) space. As shown in Fig.4(a), three coordinate frames are defined: the global inertial frame C g = o g x g y g z g , the body-fixed frame C b = o b x b y b z b , and the camera frame C c = o c x c y c z c , all of which follow the right-hand rule and their z axes are all facing downward vertically. The rotation matrix of the camera frame with respect to the body frame is denoted as b R c . The coordinate origins o b and o c are set to coincide so that the translation vector b t c is 0 in the transform matrix b T c . Any vector in the camera frame can be converted to a vector in the body frame as: The rotation matrix and the translation vector of the body frame with respect to the global frame are denoted as g R b and g t b . We can transform the camera coordinate into the global coordinate according to the transform matrix g T c , which is It is evident that g t c = g t b , and g R c = g R b b R c . Any vector c a obtained from the camera frame can be transformed to the absolute location vector g a from the global frame using Here g R b can be fetched from the onboard sensor of the 3-axis gyroscope in real-time, and b R c can be acquired through a calibration process before the flight since it is a constant matrix representing how the camera is mounted.
After the detection and tracking process, the follower UAVs are correctly recognized by the image processor. As shown in Fig.4, we note the positions of the follower UAVs as o b in the body frame, and the locations of the corresponding tracked targets as o I in the image coordinate system (ICS). Each target in Fig.4(b) provides the bounding box information including their center coordinates (x I , y I ) in the ICS, and their width w in the unit of the pixel. The optical center (OC) is presented as OC b in the plane of the follower UAV and OC I in the image. There is a relationship sourced from geometric similarity as: Here f is the focal length of the camera in the unit of the pixel. H is the relative height between the leader and the plane of the follower UAVs, which is assumed to be known since each UAV has its own altimeters and the ground station can give commands to set their heights. d I is also a known term since we know its position in the ICS and thus d c can be easily calculated. With the known positions c a for the follower UAVs in the camera frame, their relative positions in the body frame regarding the leader UAV can be converted according to Equation 4. Moreover, once the global position of the leader UAV g t b is known, the follower UAVs' absolute positions g a could be achieved using Equation 6. All the follower UAVs' localization information is then formatted by the leader UAV and finally distributed to all the follower UAVs through broadcast. Whether the relative or the global positions are needed depends on the requirement of the formation control algorithm. The follower UAVs have to remain inside the field of view of the leader UAV during the whole process, which could also be realized by using a proper formation control strategy unless detrimental outer interference is encountered.

E. FORMATION CONTROL
The formation control strategy in this study is a leaderfollower approach. One UAV plays the role of leader and the rest of the UAVs are designated as followers.  the plane of the follower UAVs and their altitudes are set the same.
We first define a directed graph G as a pair (V, E), where vertice V denotes the set of UAVs and edge E ⊆ V × V denotes the set of directed pairs of the UAVs if they share their positions. No self-edge, i.e., (i, i) ∈ E for any i ∈ V is assumed. The set of neighbors of any UAV i ∈ V is defined as a set N i := j ∈ V : (i, j) ∈ E. In this study, the one-broadcaster-multiple-receiver communication mechanism feeds each follower UAV with all of their positions. As a result, the corresponding graph G is strongly connected since any UAV has access to positions for the other UAVs in the swarm.
In this work, we use a single-integrator model for the formation controller to compute high-level velocity commands. Consider UAVs over a fully connected graph G in a 2-D plane: where p i ∈ R 2 denotes the position and u i ∈ R 2 denotes the control input of UAV i with respect to the leader UAV's body frame. The relative positions are available to UAV i so that: Let p * ∈ R N ×2 denotes the desired 2-D positions for the N UAVs regarding the body center of the leader UAV (o b /o c ). Similarly, p * ji is defined. The objective of the formation control is to satisfy the following constraints: These constraints only specify the desired displacements and they inherently drive p to p * when the leader UAV's p * 0 is assigned as the origin of its own body frame. This formation control problem can be solved by using a consensus protocol. Consider the following control law: where k p > 0 controls the formation of the swarm. Define e p := p * − p, the corresponding consensus dynamics is: where L is the Laplace Matrix, I N is the Identity Matrix, and ⊗ is the sign of Kronecker product. Similar consensus dynamics has been studied in [48] and [49]. The final velocity command also considers a saturation mechanism as: where u max is the desired maximum speed cutoff. In this study, both the simulation and the flight test are conducted when the leader UAV is stably hovering over/below the plane of the follower UAVs. The leader UAV is considered a quasi-static one due to the wind interference, and the case in which the leader UAV is dynamic will be considered in our future work. In addition, in order to maintain the security of the UAVs, an artificial potential field (APF) based collision avoidance module has been used in the flight test. It is seldom triggered as the formation tested in this study has no track intersections for safety concerns.

A. SIMULATION
The simulation environment for this study was Rflysim [50]. It is a fully functional simulation environment for testing the multi-UAV model. It has been proven to be convenient and efficient in accelerating the development speed of upper-level 75140 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. algorithms such as CV and Artificial Intelligence (AI). The popular 3D engine Unreal Engine 4 is used to ensure fidelity in visual scene rendering. As shown in Fig.5, two kinds of setups are presented in the simulation framework. The leader UAV hovered either above or below the follower UAVs, with an altitude difference of 8 m. Corresponding manipulation of the image coordinates had been conducted when presenting the follower UAVs' paths. With an image resolution of 1440 × 810, the output dataset for each case had 200 images (800 targets of the same category). The detection-trackinglocalization framework was then implemented off-board in simulation.
In this study, the simulation was conducted for three typical scenarios: sky, land, and lake. The same formation switch was conducted in these scenarios, which started with a line-shaped formation and ended with a diamond-shaped formation. The leader UAV was set static while its p * 0 kept migrating with a circular trajectory, and the virtual leader UAV position was reported so that e p 0 = 0. Three YOLOv7 versions, YOLOv7tiny, YOLOv7, and YOLOv7-X, were compared to find out how the detection performance for different scenarios was dependent on the selection of the detector. The simulated results were also presented for the tracking, localization, and formation processes which were used to verify the feasibility of the conception for the framework.

B. EXPERIMENT
The experimental system majorly included 5 quadrotor UAVs and a ground station (as shown in Fig.6). Five remote controllers accompanying the UAVs were prepared to conduct manual protection in case of emergence. Six sets of datalink were used for one ground end and five sky ends. They were working in the mode of mesh network while the leader UAV was sending out position information, and the follower UAVs only receive commands and the positions. The ground station controls the leader UAV and the formation switch tasks were initiated once the leader UAV signals its follower UAVs.
Although the swarm is heterogeneous, we used a homogeneous quadrotor platform for all the five UAVs, except that the leader UAV was additionally mounted with a camera facing downward. The self-assembled SFR450 frame quadrotor was employed as the test platform (as shown in Fig.7). The flight controllers for all the UAVs were the Nora Autopilot with PX4 installed. The MIMO201JZ module was used as the datalink to communicate with the ground station. A mesh network is maintained while only the ground control station and the leader UAV could send information. The follower UAVs could only receive commands and information, and thus considering the actual information flow, the network topology between the leader and follower UAVs is a star network. GPS was solely used as a reference to check the accuracy of the vision-based location measurement. After the flight test, the visually measured localization results were finally downloaded and compared with the GPS locations from the flight log. NVIDIA Jetson Nano was used as the embedded computer for all of the UAVs. The high-performance graphics processing unit was used for the visual detection-trackinglocalization framework on the leader UAV. The modules of YOLOv7-tiny, DeepSORT, and localization were running on the leader UAV as software packages. In contrast, only the formation control module runs on the follower UAVs. All the UAVs implemented an off-board control module in order to operate the autopilot. Each UAV run the robot operating system (ROS) on the Ubuntu 18.04 OS while the sensor information and the control command were exchanged via ROS topics.
In the real flight test, three different tasks of formation switching have been conducted in a single flight:  Results were only presented for the formation switching processes and the take-off and landing processes were not considered. Between the two tasks the leader UAV might slightly adjust its position and attitude, otherwise the leader UAV is supposed to hover statically although it is subject to wind interference according to the flight log.
During the experiments, we set the maximum speed to u max = 5 m/s, and the formation control parameter k p = 0.2. The monocular camera recorded a video with a resolution of 720 × 1280 and a frame rate of 30 fps. The frame rate of the position broadcast is 2 fps and the output frequency of the formation controller is 20 Hz. The datalink in this study has a packet loss rate of less than 2% and the single-hop delay is smaller than 3 ms, which ensures a reliable communication quality. The collision avoidance term is triggered when the inter-UAV distance is less than 8 m. These parameters are chosen such that the follower UAVs converge to equilibrium within approximately 10 s after a new command of formation is ordered.

C. DATASET AND TRAINING SETUP
The simulation video was processed by frame interception to obtain a total of 300 images of the dataset in three scenarios (each scene has 100 images). The dataset was divided into the training set, validation set and test set in a ratio of 8:1:1. The graphical image annotation tool LabelImg was used to complete the labeling of the image dataset. XML Files were then generated which contained information such as the path to the image, the location of the label box, the box category, etc.
The object tracking algorithm DeepSORT uses stochastic gradient descent (SGD) to optimize network parameters during the training process, with momentum = 0.9, weight attenuation coefficient weight_decay = 5×10 −4 , and learning rate learn_rate = 0.1. The weights are saved every 10 epochs, the total number of training epochs is 40, batch_size = 64. Kernel size is 3 × 3 and kernel number is about 2000.
The object detection experiments were conducted under Windows 10, with deep learning packages installed under Python 3.7 compiled environment, which includes tool-kits like Torch, OpenCV-Python, Numpy, etc. NVIDIA RTX 3060ti with 8G memory was used for offline computations. CUDA and cuDNN were used to achieve parallel computation during training and inference. The batch size was 8, the initial learning rate was 0.001, the input image size was 416×416, and the number of training epochs was 400.
For the real flight test, we collected a total of 3900 UAV formation images dataset. The same hardware and software environment from the simulation dataset were used for model training. The original dataset was divided into the training set, the validation set and the test set in a ratio of 8:1:1. Model testing was completed onboard by using an AI platform NVIDIA Jetson Nano, which had been compiled in Python 3.6.9, with accompanying deep learning packages such as Torch 1.7.0, Tensorboard 2.8.0, Numpy 1.19.5, and OpenCV 4.3.0 installed.

V. RESULTS
A. SIMULATION 1) DETECTION Fig.8 presents the detection results of UAVs by using YOLOv7-X. All the UAVs have been detected with relatively high precision (above 0.91). This performance maintains for the whole dataset, and it is unexpected that all the UAVs have been detected and few FN cases appeared even for results from YOLOv7-tiny and YOLOv7. Influence from the scenarios is also limited in that the YOLOv7 detector recognized UAVs from various complicated backgrounds with high precision regardless of daylight, vegetation, and landform. Table 1 presents the AP values for a more quantitative comparison of the performance of YOLOv7-tiny, YOLOv7, and YOLOv7-X in terms of different scenes. YOLOv7-X has the best performance compared with the other two detectors. However, even though YOLOv7-tiny ranks last in AP, it still has an AP value above 94%. For different scenarios, UAVs from the background of the sky tend to be more easily detected. UAVs from the land scenario are more difficult to be recognized due to the influence of the complicated land textures. However, the overall AP difference for the three scenarios is only 2%, which indicates that the influence from the scenarios is limited in simulation.

2) TRACKING
Tracking results for the evolution of the formation switching are presented in Fig.9. The land background with complicated trees and shadows is included. Images from 6 instants with tracking IDs and bounding boxes are presented. The tracking IDs are designated in sequence from left to right in the line-shaped formation at the initial state. After that, the four follower UAVs move to their allocated positions with the diamond-shaped formation gradually with their IDs correctly identified. No occlusion nor repetitive identification is occurring during the whole process, showing the robustness of the YOLOv7-tiny+DeepSORT tracking framework.
Similar tracking performance maintains for all the cases, and the evolution of the tracking IDs for the lake case is presented in Fig.10 as an example. It turns out that all the four UAVs have been identified in each of the frames. Results also showed that no other IDs appeared which means no FP targets from the background were identified as UAVs. Visual tracking results in the simulation have made good preparation for deriving the locations in the localization step.

3) LOCALIZATION
In this study, the paths for the follower UAVs are first calculated and then put into the simulation environment, which means the positions for all the UAVs are already known. Since the location and attitude of the leader UAV are also predetermined, the expected UAV locations in ICS are also known. As a result, we compare the visually tracked UAV paths in the image coordinate to evaluate the localization performance.
Localization results are presented in Fig.11 as the UAV paths in ICS coordinates. It should be noted that the camera alignment generating different observing directions would give different path orientations which need to be adjusted. For instance, paths in Fig.11 where the camera look upward need a y-axis reversion according to the center, after which the paths from the sky case could be consistent with the land and the lake cases.
Results in Fig.11 show satisfying continuity in locating different UAVs in ICS coordinates, which is beneficial in accurately controlling the formation in the real flight test. It also indicates that there is no occlusion during the tracking process, since otherwise, the color or the maker within each path would not be unified.
In order to evaluate the localization accuracy for different YOLOv7 detector and different background features, an error analysis for the path tracking results are presented in Fig.12. The general error is within 3 pixels which is about 0.1 m in the simulated environment. YOLOv7-X has the least error but YOLOv7-tiny's performance is comparable to the other two versions.

4) FORMATION
The formation switching from the line-shaped to the diamond-shaped is combined with a circular motion with constant speed. It could be realized by the static leader UAV by reporting its virtual p 0 and the roundly migrating p * 0 , which are made the same, to the follower UAVs when they receive the broadcast information.    Fig.13 show that no course intersection exists during the formation-changing process. The diamondshaped formation is achieved within 10 s. After that, the four UAVs coincide with their target positions, which are making constant speed circular motion regarding the virtual leader UAV0. Results in Fig.13 indicate that the current formation control strategy is effective in realizing simple formation-switching tasks in setups similar to the simulation in this study.

B. EXPERIMENT 1) DETECTION AND TRACKING
The evolution of the formation switch in Task 1 is presented in Fig.14. Similar to the simulation results, all of the follower UAVs have been correctly detected and assigned with the desired IDs. Task 1 lasts about 30 seconds and a diamondshaped formation is finally achieved. Since the relative altitude between the leader UAV and the follower UAVs is larger than that from the setup in simulation, UAVs appear much smaller in the images, which leads to the appearance of the FP targets due to the ground patterns though not appearing in the selected frames. Fig.15 presents all the IDs of the tracked UAV targets in the whole process of the flight test. Besides the consistent  identification of the four follower UAVs, several FP targets were also allocated with IDs. However, these FP targets' IDs appear only occasionally with different ID numbers. Since the real UAVs are already allocated with the right ID numbers, the new IDs could be easily recognized as FP outliers, which effectively diminishes the background's interference in tracking the real UAVs.

2) LOCALIZATION
The localization results are presented in the camera frame so that the formation shape appears in a consistent orientation with the images. According to the Jetson Nano's performance report, the leader UAV's onboard localization update rate has reached above 9 fps and the inference time or delay is about 46 ms, as shown in Fig.16. The UAV locations converted from the image location are compared with the GPS-measured locations after manipulating them from the longitude-latitude-altitude coordinates to the camera frame coordinates. In Fig.17, results for the localization error are compared individually for the three tasks by box plots. In general, the localization error is about 0.3 m with few outliers reaching 0.5 m. According to the mean value, the precision of the image-inference localization is within 0.1 m. No correlation is found in the error analysis regarding the UAV's ID. It is found that using the leader UAV to provide the positions in a centralized way avoided the consistency adjustment for each UAV's GPS data, especially in the absence of the RTK ground base station.

3) FORMATION
The resultant paths converted from the GPS coordinates for the follower UAVs are presented in Fig.18. Since the leader UAV reports the position with a rate of 2 fps due to the limitation of the datalink, the path positions are also provided at 2 fps. The 3 formation switching tasks have been successfully executed by each UAV by the decentralized formation controllers. The paths have deviations due to wind interference, but the desired formation switches are all accomplished.
Comparison between the visually measured and the GPS measured speed for the follower UAVs are finally presented in Fig.19. The GPS locations have been converted to the coordinates in the camera frames and they are presented with a rate of 10 fps. The vision-based measurement is presented with a rate of 2 fps, which is consistent with the broadcast frame rate. Results showed that the speed measured in different ways VOLUME 11, 2023 75145 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  had satisfying consistency. Both measurements are subject to wind disturbance, while the vision-based measurement is also influenced by the detection accuracy, which might magnify the error when calculating speed. In general, the vision-based measurement could successfully control the follower UAVs to accomplish the desired formation switches.

VI. DISCUSSION
The performance of the proposed framework has been compared in Table 2 with previous studies [31], [32], [33], [34], [51], [52], which also use vision-based methods to track or locate the markerless UAVs in swarms. All of these studies happened to have no more than three UAVs in total VOLUME 11, 2023 75147 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. inside the swarm. The major reason for the small number of tested UAVs could be that: such an arrangement avoided the occlusion of UAV IDs since at most two other UAVs appear in one UAV's field of view (FOV). The markerless swarm is lack a mechanism that effectively identifies different UAVs even though the number of UAVs is small. Moreover, the occlusion problem could be a disaster for vision-based formation control, since each UAV will need to make a consensus to assure its own position in the formation. Any failure in identity recognition would lead to a collision, and this problem has been addressed by Schilling et al. [53], which indicates that when the number of agents in the swarm increases to hundreds or thousands, the obstructed members need to be discarded and a new way of interaction need to be introduced to maintain the inter-agent distances and velocity alignment. [53] Although the proposed method is designed for formation in a 2D plane and no obstruction-induced occlusion should occur, we believe it still can bridge the gap of the swarm scalability from 3 to hundreds by utilizing the powerful DL-based MOT/MTT methods.
In this study, both simulation and real flight tests have been conducted and the simulation part not only validate the formation control algorithm but also simulated the multi-UAV tracking problem in different background conditions, which greatly contributed to putting the vision system into practice. Montijano et al. [51] demonstrated the effectiveness of the distributed formation controller using simulation before conducting the hardware experiments. However, only the formation control algorithm has been validated in the simulation but the vision system was not involved during the simulation.
Smith et al. [54] only provided simulation results to compare several tracking filtering solutions, the specificity of visionbased multi-tracking problems in different conditions was not addressed. Ge et al. [31] presented several vision-based decentralized Bayesian filtering strategies for the detection and tracking of multiple aerial vehicles and the corresponding perception consensus. However, the background effect on the filters was not discussed. Kabore and Güler [52] demonstrated the stability and convergence of the formation control law and verified the entire framework in numerous simulations, where an Iris quadcopter was put in the Gazebo simulation environment from different perspectives and backgrounds. Simulation in this study utilized the Rflysim and UE4 simulation environment, which is far more convincing when tracking multiple UAVs from different perspectives and backgrounds.
Most previous studies for vision-based formation control were tested for indoor scenarios. One of the reasons is that the indoor environment is a common scenario where the GNSS is disabled, and the vision-based formation control is a natural solution provided that no outer positioning infrastructure is available. Another reason for the lack of tests in the outdoor scenario is that UAVs appear too small to be visually detected when the span of the swarm is large. This fact could fundamentally change the selection of swarm architectures for outdoor applications. The visual sensors like cameras are no longer suitable to be mounted on each of the UAVs like those indoor cases mentioned in [31] and [52]. The reason is that mutual observation is easy to be obstructed by obstacles, especially in low altitudes, and equipping each UAV with high-resolution cameras and powerful image processors to detect remote neighbors is impractical considering the cost of the swarm. The study [51] is an exception since the UAVs depend on their own cameras to get absolute locations, and they do not track each other visually to maintain the formation. However, it is also not suitable for outdoor scenarios since the swarm member's own homography computation would consume tremendous computation resources [55], and the relative displacement maintained based on absolute localization would induce much higher errors [16]. If radio communication is not totally prohibited, the hierarchical swarm structure is recommended by using a leader UAV to inform the follower UAVs of their relative location at a different altitude while the formation is controlled in a decentralized way. It is proposed not only because it avoided the aforementioned disadvantages, but it can generate more accurate formation when the leader UAV is equipped with more sophisticated sensors [32], [33].
We finally address the edge computation capability for the vision-based formation control problem for the UAV swarm. Using DL-based visual detectors and trackers as unified CV solutions have become the stoppable trend to track UAVs according to the evolution of methods listed in Table 2. Onboard AI computers capable of DL manipulations with lighter weight, lower cost, and stabler performance would keep their popularity in the community. The selection of more advanced hardware and more efficient detector/tracker algorithms will further improve the performance of the visionbased formation control strategy. Future works could make improvements regarding enlarging the swarm scalability and diminishing the localization errors. Vision-based swarm navigation with complicated tasks is also a potential research direction under the current framework, which we hope to address in our following studies.

VII. CONCLUSION
This study proposes a vision-based formation control framework for a UAV swarm. The hierarchical architecture is constructed for the UAV swarm where a leader UAV provides the visual localization information for its follower UAVs. State-of-the-art deep learning tools like YOLOv7 and DeepSORT are used for UAV detection and tracking, respectively. The location information of the UAVs is then derived and distributed, based on which the formation control algorithm is executed. In addition, with the booming development of the AI computer for embedded applications, these resource-intensive algorithms have been directly implemented onboard. We validate this vision-based formation control framework in simulation and a real flight experiment, respectively. It turns out that this vision-based framework could achieve real-time formation control in the absence of the GNSS. The localization update rate reached 9 Hz and the inference time was about 50 ms during the flight test with the onboard computer. The error of the vision-based relative localization during formation switching was within 0.3 m. Next, we will further study performance optimization and explore the capability of the proposed vision-based framework by using more UAVs.