Distilling Tiny and Ultra-fast Deep Neural Networks for Autonomous Navigation on Nano-UAVs

Nano-sized unmanned aerial vehicles (UAVs) are ideal candidates for flying Internet-of-Things smart sensors to collect information in narrow spaces. This requires ultra-fast navigation under very tight memory/computation constraints. The PULP-Dronet convolutional neural network (CNN) enables autonomous navigation running aboard a nano-UAV at 19 frame/s, at the cost of a large memory footprint of 320 kB -- and with drone control in complex scenarios hindered by the disjoint training of collision avoidance and steering capabilities. In this work, we distill a novel family of CNNs with better capabilities than PULP-Dronet, but memory footprint reduced by up to 168x (down to 2.9 kB), achieving an inference rate of up to 139 frame/s; we collect a new open-source unified collision/steering 66 k images dataset for more robust navigation; and we perform a thorough in-field analysis of both PULP-Dronet and our tiny CNNs running on a commercially available nano-UAV. Our tiniest CNN, called Tiny-PULP-Dronet v3, navigates with a 100% success rate a challenging and never-seen-before path, composed of a narrow obstacle-populated corridor and a 180{\deg} turn, at a maximum target speed of 0.5 m/s. In the same scenario, the SoA PULP-Dronet consistently fails despite having 168x more parameters.


I. INTRODUCTION
W ITH the growth of the Internet of Things (IoT), autonomous nano-sized unmanned aerial vehicles (UAVs) empowered with onboard artificial intelligence (AI) are becoming increasingly important [6].With a diameter of less than 10 cm and weighing only tens of grams, these nano-UAVs can serve as ubiquitous IoT nodes, autonomously navigating environments while sensing and analyzing their surroundings [7].Their compact form factor allows them to operate in narrow/cluttered spaces [5], [8] and safely in the proximity of humans [9] (as shown in Figure 1).These characteristics enable nano-UAVs to be suitable in many use cases, such as exploring hazardous indoor environments [10] and in rescue missions [11], [12].
To act as autonomous smart IoT sensors, nano-UAVs must execute concurrent real-time tasks entirely onboard, including multiple heavy AI perception workloads [4], [5], [9], [13], [14].However, achieving such a high level of autonomy presents significant challenges.Their extremely limited payload restricts them to accommodate only ultra-low-power microcontroller units (MCUs) that have stringent computational and memory constraints, e.g., a few hundred kB of on-chip memory and few giga operations per second (GOps/s).These constraints have prevented the deployment of multiple AI tasks on nano-UAVs.Therefore, overcoming the computational/memory burden imposed by MCUs becomes paramount.We focus on optimizing and minimizing the AI workloads without compromising the drone's behavior when stressed in real-world testing scenarios.
We specifically address a critical AI task required by any IoT node moving within an environment: visual navigation, i.e., the capability to autonomously navigate an environment by relying solely on local visual information while avoiding collisions with obstacles.The State-of-the-Art (SoA) convolutional neural network (CNN) for visual-based autonomous navigation on nano-UAVs is PULP-Dronet v2 [5], outputting a collision probability and a steering angle following visual cues in an environment, such as lanes.This CNN has been deployed on a nano-sized UAV and demonstrated in both indoor and arXiv:2407.12675v1[eess.IV] 17 Jul 2024 outdoor environments, enabling turns and avoiding collisions with dynamic obstacles.However, this CNN has some shortcomings.First, its navigation capabilities come at a non-negligible computational cost, allowing for a maximum throughput of 19 frame/s when running on the SoA Greenwaves Technologies (GWT) GAP8 System-on-Chip (SoC) [5].
Moreover, we show that this CNN performs poorly when navigating static obstacles, limiting its real-world applicability.This limitation stems from the training dataset featuring disjoint image sets for steering and collision labels [5].A solution to this issue is to collect new data that integrates joint information on obstacle presence and yaw rate into all the training images.
In this paper, we set a new milestone in the SoA of autonomous visual-based nano-UAV navigation by shrinking the CNN memory footprint and computational burden while simultaneously improving its ability in static obstacle avoidance.Our key contributions are: • we develop a methodology for dataset collection on a nano-UAV.We collect unified collision avoidance and steering information only with nano-UAV onboard resources, without dependence on external infrastructure like motion capture systems.The resulting PULP-Dronet v3 dataset consists of 66 k labeled images, which we release open-source along with our dataset collection framework; • we train and deploy a novel family of CNNs, i.e., PULP-Dronet v3, that enables visual-based static obstacle avoidance and autonomous navigation on nano-UAVs; • we perform an extensive study for minimizing the memory footprint and computational complexity of CNNs for autonomous nano-UAV navigation.The resulting CNNs offer different trade-offs between in-field performance and model size.We validate all our CNNs on our collected dataset and characterize their end-to-end execution time on the GAP8 SoC.Our tiniest CNN, i.e., Tiny-PULP-Dronet v3, has only 2.9 k parameters and 1.1 M multiply-accumulate (MAC) operations, 168× smaller and 7.3× faster (up to 139 frame/s) than the baseline CNN, i.e, the SoA PULP-Dronet v2 [5], when running on the same PULP GAP8 SoC.This outcome allows us to free up precious computational resources that can be exploited to tackle additional tasks onboard.This CNN only requires 0.7 mJ for each inference, with an average power consumption of 100 mW when running on the GAP8's cores at their maximum frequencies.
Furthermore, we deployed and extensively tested in-field two CNNs: a larger one (PULP-Dronet v3), matching the size of the baseline CNN, and our tiniest distilled CNN (Tiny-PULP-Dronet v3).These tests evaluate how reducing the computational CNN workload impacts the navigation performance.In a challenging scenario with a narrow corridor containing four static obstacles and a 180°turn, both CNNs outperformed the baseline one, as shown in the supplementary video.At a target speed of 0.5 m/s, our large and tiny CNNs successfully navigate through the corridor with a 80% and 100% success rate, respectively, while the baseline CNN consistently fails.At higher speeds, the larger CNN demonstrated better robustness, succeeding 60% of the time at a target speed of 1 m/s, whereas the tiny CNN failed.We foster the research on autonomous nano-UAV navigation by releasing our new dataset, dataset collection framework, CNN weights, and code as open-source.
The remainder of this work is organized as follows.Section II provides an overview of tinyML for autonomous nano-UAVs.Section III covers the background of our work.Section IV details our dataset and collection methodology.Section V describes our CNN design/shrinking approach.Section VI compares the CNNs we introduced with experimental results.Section VII describes the in-field experiments with our nano-drone.Finally, Section VIII concludes the paper.

II. RELATED WORK A. TinyML on autonomous nano-sized UAVs
This section presents an overview of lightweight and compact deep learning (DL) algorithms deployed on autonomous nano-UAVs, specifically nano-drones, which we summarize in Table I.To cope with the limited resources of nano-UAVs, multiple works in literature aim to achieve only minimal functionalities and/or rely on low-dimensional input signals (e.g., no images) to minimize execution workload [1], [2], [15].
In the work of Kooi and Babuška [1], a deep reinforcement learning method employing proximal policy optimization enables autonomous landings of nano-drones on inclined surfaces.The CNN designed for this purpose requires around 4.5 kMAC operations per forward step.Although the CNN operates efficiently, computing an inference in approximately 2.5 ms on a single-core Cortex-M4 processor, its functionality remains constrained solely to the landing task.Similarly utilizing the Cortex-M4 processor, Neural-Swarm2 [2] leverages a controller based on DL for managing interaction forces during nano-drones' formation flights.With a processing cost of ∼9.5 kMAC, each nano-drone in the swarm processes the relative position and velocity data of surrounding UAVs, tackling safe maneuvers in close proximity with other UAVs.
To overcome computational constraints posed by single-core MCUs, multi-core flight controllers designed for AI workloads address the limitations imposed by higher-complexity DLbased workloads.The SoA MCU for commercial off-theshelf (COTS) nano-UAVs is the GAP8 SoC, an embodiment of the parallel ultra-low-power (PULP) paradigm with a general-purpose 8-core parallel cluster.This SoC offers a peak throughput of 5.4 GOps/s [16] and 512 kB of on-chip memory.This fully programmable MCU was successfully exploited to enable the execution of more sophisticated visual-based AI workloads [4], [5], [9].
Cereda et al. [9] exploited GAP8 to demonstrate a fully autonomous nano-drone performing a human pose estimation task.Their CNN, called PULP-Frontnet, predicts the drone's relative pose with respect to a freely moving human subject.This prediction aims at maintaining a consistent distance in front of human subjects while following their movements, and it only exploits low-resolution grayscale images captured from a front-facing camera.Their model requires 14.7 MMAC and achieves an inference rate of 48 frame/s while consuming 96 mW.[3] exploited neural architecture search techniques to design a tinier version of the PULP-Frontnet CNN, obtaining a 7.4 MMAC and 65 kB model, while retaining good regression performance when tested in-field.Like our work, their approach to CNN architecture design explores structures inspired by Mobilenet v2 [17].
Exploiting the same GAP8 SoC, Lamberti et al. [4] tackled a more complex high-level task for nano-UAVs: object detection, which has a computational complexity significantly greater (over an order of magnitude higher) than [3], [9].They deployed a 4.7 MB CNN capable of detecting two classes of objects, demonstrating its application in the context of an exploration and search mission.Such workload requires a significant effort of 534 MMAC per inference and exceeds the SoC on chip L2 memory of 512 kB, introducing additional latency cost for transferring data between on-chip and off-chip memories.As a result, the system only achieved a throughput of 1.6 frame/s when tested onboard.
Zhang et al. [13] deployed a compute-intensive CNN with 310 k parameters for relative visual-based depth estimation on a nano-drone.The depth information from the front-facing grayscale camera is exploited for obstacle avoidance.However, the CNN runs onboard only at 0.2 frame/s and 1.2 frame/s on QVGA and QQVGA images, respectively, limiting the navigation capabilities of the drone to low speeds.Niculescu et al. [5] used instead a CNN for enabling end-to-end visualbased autonomous navigation on a nano-drone embedding the GAP8 SoC.The CNN, namely PULP-Dronet v2, has a computational cost of ∼41 MMAC per inference and requires 320 kB of memory footprint.When tested on an autonomous nano-drone, this neural network allowed it to tackle turns and avoid dynamic obstacles appearing along the way.
However, PULP-Dronet v2 has two main limitations.First, as highlighted by the authors [5], this CNN lacks the ability to guide the nano-UAV around static obstacles.This poses a significant limitation in its practical application within challenging real-world scenarios.This missing capability is attributed to the training dataset of PULP-Dronet v2, as it comprehends multiple sets of images with disjointed labels for the steering and collision tasks [5].The second limitation of this work is its non-negligible workload of ∼41 MMAC, limiting its throughput to a maximum of 19 frame/s.Despite being enough to enable autonomous navigation, this system would greatly benefit from solving the same task with only a minimal fraction of the current workload, as it would free precious resources that can be exploited to tackle additional tasks [11], [12], [18].[18] demonstrates that the PULP-Dronet v2 CNN architecture is over-parametrized for the task it solves.However, it lacks an in-field demonstration that a smaller neural network can achieve sufficient accuracy for safe flight.
Therefore, in our work we move from PULP-Dronet v2, focusing on: i) enabling new navigation skills, i.e., tackling the static obstacle avoidance scenario, demonstrating with in-field results that the dataset was the limiting factor for achieving this task.ii) we enable autonomous navigation with a CNN that is 168× smaller than PULP-Dronet v2, leaving plenty of memory and computing resources that can be exploited to embed additional AI tasks on the nano-drone.

B. Datasets for nano-UAV autonomous navigation
As we highlighted the importance of introducing a new dataset for visual-based autonomous nano-drone navigation in Section II-A, we review the datasets for UAVs in the literature.Dupeyroux et al. [19] released a dataset for obstacle detection and avoidance, specifically targeting micro-sized drones (i.e., weighting ∼0.5 kg).It has 92 GB of data, including full-HD images for ∼ 80% of the acquisitions, and the drone pose is tracked with a motion capture system.Thus, adapting these images to the nano-drone use case would require a photometric augmentation pipeline to convert full-HD images to the format of a low-quality and low-resolution camera commonly found on nano-UAVs [5].Conversely, we collect a dataset exploiting nano-drones exclusively: we log the lowquality images from the onboard camera along with the pilot's yaw-rate input, which is fed to the flight controller during the data acquisition.This approach allows us to train an end-toend CNN for autonomous navigation by exploiting the pilot's input commands rather than the UAV's external tracking.
The Dronet dataset, introduced by Loquercio et al. [20], was created to enable autonomous navigation on large UAVs.They merged two distinct image sets: one with 39.1 k highresolution images from car driving scenarios labeled only with steering information, and another with 32.2 k high-resolution images from bicycle rides labeled with binary collision information.This dataset was expanded with approximately ∼ 1300 grayscale low-quality images collected from a nano-UAV and labeled only with binary collision information [5].
The combination of these three disjointed sets of images forms the Himax+Original dataset used to train PULP-Dronet, which we exploit for comparison in Section VI.
However, this dataset does not provide joint steering and collision labels for the navigation, leading to poor performance when tackling static obstacle avoidance [5], as detailed in Section II-A.Similarly, [21] collected ∼15 k images for solving the same task as Dronet [20].However, this dataset is also divided into two disjointed sets of images for steering and collision labels.[22] used the Dronet dataset to train an off-board CNN for nano-drones autonomous navigation, still not tackling the static obstacle avoidance scenario.
The dataset presented in this paper overcomes these limitations as all the ∼66 k collected images feature both collision and steering information, i.e., it does not have disjointed labels.Specifically, we log the input of a human pilot navigating a nano-drone in multiple environments.These labels can be used to train an end-to-end CNN that imitates the behavior of a human pilot.Furthermore, we log the drone's estimated state and label all the images with a binary label for obstacle avoidance.We open-source our dataset, dataset collector framework, and pre-trained models to foster research on autonomous nanodrone navigation.

A. Robotic platform
The robotic platform employed in this work encompasses the Bitcraze Crazyflie 2.1 quadrotor, a commercially available nano-drone weighing 27 g and measuring 10 cm in diameter.This open-source and open-hardware drone uses the STM32F405 MCU as its flight controller coupled with the Bosch BMI088 inertial measurement unit (IMU), which combines an accelerometer and gyroscope.The STM32 MCU operates at speeds up to 168 MHz and integrates 48 kB static RAM (SRAM) and 128 kB flash memories, facilitating onboard inertial state estimation and actuation control.The IMU data drives an extended Kalman filter for state estimation at a rate of 100 Hz, while a proportional-integral-derivative control loop cascade manages actuation.This cascade comprises two control loops, with one governing attitude at 500 Hz and the second updating position at 100 Hz.The Crazyflie 2.1 also integrates a nRF51822 MCU for 2.4 GHz radio band communication, which we use only for the purpose of dataset collection (Section IV-B).
Our configuration extends the robotics platform with two pluggable printed circuit boards (PCBs) from Bitcraze: the Flow deck and AI-deck.The 3.5 g Flow deck incorporates the PMW3901 optical flow visual sensor and the VL53L1x timeof-flight (ToF) ranging module.The optical flow sensor detects drone motion in multiple directions, while the ToF sensor measures distance from the ground.These inputs enhance onboard state estimation accuracy, minimizing long-term drift.
The second expansion board, the AI-deck, serves as the primary onboard computing unit responsible for executing heavy AI workloads, such as the proposed autonomous navigation CNNs.This PCB, weighing 4.4 g, integrates the GWT GAP8 SoC as the onboard computer for visual processing and AI-driven autonomous navigation.GAP8 is accompanied by onboard memories: 64 MB of HyperFlash and 8 MB of HyperRAM.The PCB also mounts a Himax HM01B0 ultralow-power monochrome QVGA camera, an ESP32-based Wi-Fi transceiver, and a UART communication channel between the STM32 and the GAP8.In this work, we focus on a configuration where the power-intensive Wi-Fi remains off, aiming for a fully autonomous system where all navigation intelligence resides onboard the nano-drone without relying on external communication or computation.However, we exploited the high bandwidth of the Wi-Fi communication for the dataset collection, as detailed in Section IV-B.
For the sake of dataset collection only (Section IV-B), we add an extra PCB to our drone, i.e., the Multi-ranger deck.This deck weighs 2.3 g and comprises five single-beam VL53L1x ToF distance sensors placed on the drone's top and lateral sides.These sensors provide line-of-sight distance measurements with a frequency of 20 Hz and have an operative range that goes from 0 m to 4 m.

B. GAP8 System-on-Chip
The GWT GAP8 SoC1 , which is used in this work for all onboard vision and tinyML processing, is an ultra-low-power SoC that combines the capabilities of a regular MCU, the Fabric Controller (FC) domain, with a fully programmable parallel accelerator called Cluster (CL) that can yield up to 5.4 GOps/s [16] in a power envelope of ∼100 mW.The FC domain includes a low-power processor implementing the RISC-V RV32IMCXpulpv2 instruction set architecture (ISA) based on the RI5CY design, paired with on-chip SRAM memory and peripherals supporting protocols like SPI, I2C, HyperBus, and Camera Parallel Interface (CPI).These can be accessed through a dedicated Direct Memory Access (DMA) engine called µDMA to offload the communication burden from the FC.
The CL accelerates heavier parallelizable workloads typical of digital signal processing and AI applications.It includes 8 RISC-V cores with the same ISA extensions as the FC; in particular, the Xpulpv2 extension includes 8-bit and 16bit packed single instruction multiple data (SIMD), multiplyaccumulate, and dot-product operations, which enhance the SoC's linear algebra capabilities for such workloads.The cores are connected over a logarithmic interconnect to 64 kB of L1 Tightly Coupled Data Memory (TCDM) comprising 16 memory banks.The logarithmic interconnect assures 1cycle latency access for all the cores when there is no bank conflict, enabling fast data parallelism among the cores.The cluster has a DMA to offload data transfers between the TCDM and the 512 kB L2 memory.The cores are programmed with the Single-Program Multiple-Data programming model and synchronized using dedicated hardware for low-latency barriers.
The FC and cluster domains are separately clocked to tune for best energy efficiency and performance trade-off.The FC domain can be clocked between 50 MHz and 250 MHz, while the cluster domain can be between 100 MHz and 175 MHz.The SoC voltage (V dd ) can be set between 1 V and 1.2 V depending on the FC and cluster clock frequency.

C. Automatic Deployment Tools
Developing CNNs for an MCU-class processing device, like the GAP8 on our nano-drone, is a multi-objective optimization problem.In our case, it must take into account: i) memory limitations (L2 512 kB and L1 64 kB), ii) the hardware limitations (i.e., no floating point unit), power envelope (∼100 mW), and throughput (i.e., a flying drone needs to react in real-time).As for PULP-Dronet v2 [5], we use an automated flow that works in two steps: first, we quantize the neural network [23], and then we perform a hardware-aware deployment of the quantized model.
For quantization, we take float32 pre-trained CNN models, and we apply post-training quantization using NEMO [24], a deep neural network (DNN) quantization tool that performs uniform asymmetric static layer-wise quantization.We apply fixed-point 8-bit (int8) post-training quantization to the weights and activations of our CNNs.By doing so, we enable optimized 8-bit fixed-point arithmetic on GAP8, i.e., packed-SIMD instructions [16].Moreover, the conversion from float32 data type and the quantized int8 reduces by 4× the memory footprint of the CNNs models.
For hardware-aware deployment, we use DORY [25], a SoA code generation tool for quantized DNNs.It is tuned for deployment on memory-constrained embedded devices with multiple levels of memory hierarchy to optimize performance.To fit the data in the available memory resources, large operations need to be split into smaller pieces called tiles.DORY models the tile size optimization problem as a constraint programming problem with the memory size as the main constraint and hardware-aware heuristics to guide the ILP solver toward the best-performing solutions.
While tiling makes the operations fit into our desired memory, it still requires data marshaling between the memory levels to process the whole layer.DORY overlaps this data movement cost with computation by employing multi-buffering and software pipelining to hide this data movement cost.For efficient operation computation, DORY exploits the PULP-NN [16] open-source library, which offers hand-optimized kernels for quantized neural networks executing on the PULP cluster.

D. The baseline PULP-Dronet v2 CNN
PULP-Dronet v2 [5] is an end-to-end vision-based CNN for autonomous navigation aboard nano-drones and suited for deployment on the AI-deck.We employ PULP-Dronet v2 as a baseline model for comparison with our work.The architecture of this CNN is represented in Figure 3, and it is based on a sequence of three residual blocks [26], which we denote as ResBlock (RB).Each block consists of two parallel branches: the main branch performs two 3×3 convolutions, while the bypass one performs one 1×1 convolution.The number of output channels is 32, 64, and 128 for the three blocks, respectively.For each block, the last convolution on both branches applies a stride factor of 2 to half the feature map size.Following each convolution is a Batch Normalization (BN) layer and a rectified linear unit (ReLU) activation function.The ReLU output is capped to the value 6, therefore called ReLU6.
The CNN processes a grayscale image of size 200 × 200.This image is the bottom-center crop extracted from the QVGA image captured by the AI-deck's Himax camera.The final layer produces two distinct outputs: a collision probability (classification problem) and a steering angle (regression problem).Consequently, the model is trained using two distinct metrics: the mean squared error (MSE) and the binary crossentropy (BCE).These metrics are combined into a unified loss function, Loss = M SE + βBCE, which we also employ in this work.β is set to 0 for the first 10 epochs, gradually increasing in a logarithmic way to prioritize the regression problem.The training employs a dynamic negative hardmining approach, gradually focusing the loss computation on the k-top samples exhibiting the highest error.PULP-Dronet v2 is deployed in fixed-point 8-bit arithmetic on the multicore GAP8 SoC, yielding 41 MMAC operations per frame and 320 kB of weights.

A. Dataset collection framework
We developed a software framework for dataset collection, outlined in Figure 2-A.It enables a pilot to manually control the nano-drone and simultaneously log: i) any data from the STM32 flight controller (e.g., the drone's state estimation) and sensors attached to it (e.g., additional expansions decks of the Crazyflie 2.12 , such as ranging sensors), ii) images captured by the AI-deck's front-facing camera.All data is recorded on a PC base station, which independently receives the two data streams: the STM32 packets, transmitted via radio, and AIdeck images, transmitted via Wi-Fi.To ensure the data from these two separated streams can be accurately matched in post-processing, we i) periodically synchronize both the STM32 and GAP8 MCUs with a global clock, generated from a GPIO of GAP8, and ii) send each chuck of data associated with a timestamp that is globally synchronized across our robotic platform.This synchronization methodology minimizes matching errors by ensuring each packet is timestamped at its source, thus avoiding errors related to wireless communication and leaving only minimal discrepancies caused by the MCUs' oscillator drifts.
Our dataset collector framework consists of: i) code for the STM32 and GAP8 SoC (Figure 2-A), which enables sending the data streams while keeping the global clock synchronized, and ii) a graphical user interface (GUI) (Figure 2-B) to plot the data streamed from the flight controller and visualize images collected from the AI-deck.

B. Our dataset for nano-drone autonomous navigation
We introduce a new dataset for visual-based autonomous nano-drone navigation.We move on from the limitations of the past PULP-Dronet dataset [5], which was created by assembling different sets of data, each one having images labeled either for the steering or the collision avoidance task, as explained in II-A, ultimately penalizing the static obstacle avoidance [5].To tackle this limitation, we collect a new unified dataset from scratch.
With the dataset collector framework introduced in Section IV-A, we collected a dataset of 66 k images for nanodrones' autonomous navigation, for a total of ∼600 MB of data.We used the Bitcraze Crazyflie 2.1, described in Section III-A.A human pilot manually flew the drone, collecting i) images from the grayscale QVGA Himax camera sensor of the AI-deck, ii) the gamepad's yaw-rate, normalized in the [−1; +1] range, inputted from the human pilot, iii) the drone's estimated state, and iv) the distance between obstacles and the drone measured by the front-looking ToF sensor.
After the data collection, we labeled all the images with a binary collision label whenever an obstacle was in the line of sight and closer than 2 m.We recorded 301 sequences in 20 different environments.Each sequence of data is labeled with high-level characteristics: scenario (i.e., indoor or outdoor), path type (i.e., presence or absence of turns), obstacle types (e.g., pedestrians, chairs), flight height (i.e., 0.5 m, 1 m, 1.5 m), light conditions (dark, normal, bright), acquisition date, and a location name identifier.
For training and validating our CNNs, we post-processed the datasets as follows.We used 70%, 10%, and 20% of the images as training, validation, and testing sets, respectively.We augmented the training images by applying random flipping, brightness augmentation, vignetting, and blur.The resulting training dataset has ∼124 k images, split as follows: 109 k, 5 k, 10 k images for training, validation, and testing, respectively.To address the labels' bias towards the center of the [−1; +1] yaw-rate range in our testing dataset -specifically, the over-representation of images associated with a yaw-rate of 0, indicating no input from the human pilot and thus the drone flying straight -we balanced the dataset by selectively removing a portion of images that had a yaw-rate of 0.
The final distribution of our balanced dataset is represented in Figure 4.In the same Figure, we also report the RMSE and Accuracy for three trivial predictors, i.e., a predictor that either always predicts collision to 1 or 0, and a predictor that always predicts a yaw-rate of zero (i.e., go straight).These values can be later used in Section VI as a baseline to assess the RMSE and Accuracy performance of our trained CNNs.

V. NEURAL NETWORK ARCHITECTURE DESIGN
The original PULP-Dronet v2 CNN's architecture [5] exploits three ResBlocks (RB) [26], as described in Section III-D.In other vision tasks, the RB architecture has been largely replaced by other kinds of CNN tolopogies, which provide similar accuracy characteristics but with lower workload and memory footprint [3].Therefore, we design the new PULP-Dronet v3 architecture by exploring several modifications to its baseline topology.
Our modifications, detailed in Figure 3, involve substituting the three RB blocks of PULP-Dronet v2, removing parallel by-pass branches, and varying the number of channels in the intermediate feature maps.To optimize autonomous navigation and enable more complex onboard functionality, we studied several modifications to this setup.First, we replace these blocks with two other blocks we designed, taking inspiration from the well-known Mobilenet family of CNNs [17], [27].These CNNs have been shown to be both accurate for visual AI tasks and suitable for efficient MCU deployment [3].For each block we substitute, we keep the same output feature map dimensions (width and height) of PULP-Dronet v2.
The first new block we design is based on the Mobilenet v1 [27].It uses separable depthwise and pointwise (D+P) convolutional layers instead of traditional ones.Such layers factorize a standard convolution into a sequence of K×K depthwise convolution and a 1×1 convolution called pointwise convolution, where K is the kernel size.We define our D+P block as consisting of two branches: the main branch performs two 3 × 3 D+P convolutions, while the parallel by-pass branch executes a standard 1 × 1 convolution.We apply convolutional strides different from 1 only to the last convolution in both branches.Each convolution (depthwise, pointwise, and standard) is subsequently followed by a BN layer and a ReLU6.We refer to the sequence of these three layers as a Conv-BN-ReLU.This Conv-BN-ReLU pattern simplifies both quantization and deployment explained in Section VI-D: the tensor outputted by the Conv operation demands a finer grain representation (utilizing 32 bit) compared to the inputs and weights.Therefore, the BN is used to accumulate a 32 bit representation, while the ReLU reduces it back immediately to 8 bit.
The second block we define is inspired by Mobilenet v2 [17], using inverted residuals and linear bottlenecks (IRLB) layers.This block comprises i) a 1 × 1 convolution, expanding the number of channels by an expansion factor (E), ii) a D+P convolution, and iii) a 1×1 projection convolution that inverts the expansion, reducing the number of output channels.Stride factors different from 1 are applied in the D+P convolution.We choose an expansion ratio equal to 6 as in [17].Additionally, we add a by-pass branch performing a 1 × 1 convolution, ensuring the same output size as the main IRLB branch.
We explore additional architecture variations by considering the of parallel by-pass branches from each block type.These branches primarily mitigate vanishing gradient effects in deep CNNs [26].However, recent studies [18] have highlighted their inefficacy in shallow CNN models, such as the seven-layer PULP-Dronet.Lastly, we vary the number of channels of the intermediate CNN feature maps to investigate the accuracy, memory footprint, and computational cost tradeoffs.As in [18], [27], we thin the CNNs' tensors by applying a γ dividing factor to the number of channels across all convolutional layers.We span the γ parameter in the range [1,2,4,8], where γ = 1 corresponds to the baseline size for PULP-Dronet v2 [5].

A. Datasets evaluation
In Table II, we assess the impact of training and testing on our new dataset (v3) versus the Himax+Original PULP-Dronet v2 dataset used in the SoA work [5] (v2); each dataset is split into different training and testing sets.We perform this assessment by keeping the original PULP-Dronet v2 CNN architecture in both cases.When the CNN is trained on v2 and tested on v2, we achieve performance consistent with the results reported in [5].When training on v3 and testing on v3, both performances decrease compared to the previous case due to the higher complexity of the v3 testing set, such as a more uniform distribution of steering labels and a richer, and therefore, more challenging, dataset.We also report the crossvalidation of training on v2 and testing on v3, and training on v3 and testing on v2.In both cases, performances drop compared to training and testing on the same dataset version, but the v2-trained CNN (tested on v3) has a higher drop on both RMSE and Accuracy than the v3-trained (tested on v2), suggesting better generalization capabilities for the proposed dataset.

B. CNN architectures exploration
To evaluate the PULP-Dronet v3 family of CNNs proposed in Section V, we start by analyzing different block types (RB, D+P, and IRLB) and the effect of removing the bypass branches.In Table III, we assess the CNNs in terms of parameter count, MAC operations, and key performance metrics -the Accuracy for the classification problem (the higher, the better) and the RMSE for the regression one (the lower, the better).
Removing the by-pass branches negligibly affects the CNN performance metrics in all cases, i.e., at most +0.005RMSE and +1% in accuracy with IRLB CNN.On the other side, the by-pass removal brings the benefit of reducing the CNN size by 3 − 11% and the number of MAC by 3 − 15%, depending on the architecture.Focusing on the three CNNs without by-pass, the RB variant achieved the lowest RMSE of 0.339, whereas D+P and IRLB models score 0.350 and 0.369, respectively.Instead, the D+P neural network achieves the highest performance of 84% on the accuracy score.The D+P CNN is also the faster CNN, requiring only 12 MMAC per inference, and the smaller in memory usage, being 6.2 × smaller than RB and 2.5 × smaller than IRLB.Ultimately, we select the D+P model without by-pass branches for its efficiency and minimal performance drop.

C. CNNs size analysis
In this section, we analyze how the number of channels across our CNNs affects their memory requirements, number of operations, and regression/classification performances.Starting from the CNN architecture D+P without by-passes, selected in Section VI-B, we span the γ parameter (see Section V) in the range [1,2,4,8], where γ = 1 corresponds to the baseline size for PULP-Dronet v2 [5].Table IV shows that using a dividing factor γ = 2 does not affect the classification accuracy of the CNN when compared to γ = 1, both scoring 84%.When using smaller CNNs, the classification accuracy drops by 3% with γ = 4 and by 7% with γ = 8.
On the regression task, the RMSE gradually increases as the CNNs become smaller.For γ = 1, the RMSE stands at 0.350 , but increase to 0.367 , 0.373 , 0.379 for γ = 2, γ = 4, γ = 8, respectively.On the other hand, the γ factor greatly impacts the number of the CNN's parameters.The biggest CNN (γ = 1) has 204 k parameters, the CNN with (γ = 2) has 3× less parameters (69 k parameters).Increasing the γ to 4 leads to a model with 26 k parameters, and finally the smallest CNN (γ = 8), which we call Tiny-PULP-Dronet v3, leads to only 1.9 k parameters.

D. Quantization and deployment
In this section, we progress with quantizing and deploying the four D+P models without by-passes, introduced in Section VI-C, to evaluate their trade-off between accuracy/RMSE and onboard execution efficiency.We apply 8-bit post-training quantization, as detailed in Section VI-D.Table V outlines the quantized models' regression/classification, compared to nonquantized models in Table IV.Quantization introduces only 2% reduction in accuracy on the CNN employing γ = 2, while other CNNs keep the same score of the 32-bit precision.Instead, the RSME is more sensitive to the new int8 data type, with the error slightly increasing of 0.011, 0.006, 0.005, and 0.009, for γ = 1, 2, 4, 8, respectively.Compared to the original float32 models (Table IV), we reduce by 4× the memory footprint of each CNN.
Then, we deploy these four quantized CNNs on the GAP8 SoC to analyze their on-device performances.Table VI shows their inference rate (frame/s) when GAP8 is running at its max performance (mp) configuration, i.e., FC@250 MHz, CL@175 MHz, and V dd @1.2 V .Our largest CNN model (γ = 1) achieves a throughput of 34 frame/s, which is 1.8× higher than the SoA PULP-Dronet v2, peaking at 19 frame/s, despite having the same number of channels across the CNN architecture; this improvement derives from our architecture modifications.Other configurations of γ result in 61 frame/s with 14 kB, 101 frame/s with 4.7 kB, and 139 frame/s with 1.9 kB for γ = [2,4,8], respectively.Our smallest model (γ = 8), called Tiny-PULP-Dronet v3, improves the throughput by 7.3× and reduces the memory footprint by 168× compared to PULP-Dronet v2.
Last, we measure the power consumption of all CNNs under SoC configurations: i) the max performance setting, and the energy-efficient (ee) one, which operates at FC@50 MHz, CL@100 MHz, and V dd @1.0 V.In the ee configuration, the CNN with γ ∈ 1, 2 consume 38 mW, while the CNNs with γ ∈ 4, 8 show an average power consumption of 34 mW.On the other hand, the mp setting sees all CNNs averaging a power consumption of 100 mW.As shown in Table VI, the energy for one-frame inference with the ee configuration (E ee ) is 2.1 , 1.1 , 0.6 , 0.4 mJ for the γ = 1, 2, 4, 8, respectively.On the other hand, the one-frame inference energy needed for the mp configuration (E mp ) is always ∼ 1.5× higher.

VII. IN-FIELD TESTING
We evaluate the navigation capabilities of our PULP-Dronet v3 and Tiny-PULP-Dronet v3 CNNs with in-field experiments.We record all experiments in a flying room equipped with a Qualisys motion capture system (24 cameras) with mmprecise tracking of our drone.We track the position and pose of our drone @100 Hz to analyze the drone's flight in postprocessing.We investigate if the new dataset we collected, having unified labels for yaw-rate and collision probability, improves the navigation capabilities of the SoA PULP-Dronet v2, which was trained with disjoint steering and collision probability labels, and therefore struggles in avoiding static obstacles, as described in [5].
We devised a challenging navigation scenario involving a U-shaped corridor, illustrated in Figure 5, and divided it into three segments.Segments S1 and S3 are straight paths, each one presenting two obstacles, whereas segment S2 features a 180-degree turn.The 2 m wide corridor has obstacles 1 m wide that are leaning on opposite walls, blocking straight pathways.We designed two experiments: i) one where all obstacles are static, and ii) one where the second one is dynamic.In the experiment with a dynamic obstacle, obstacle 2 appears in the center of the lane as soon as the drone passes the first obstacle, providing 1.5 m of braking space, and it is removed 5 s later.
These scenarios aim at testing the nano-UAV's capabilities on multiple skills: i) avoiding both static and dynamic obstacles, ii) navigation through a narrow environment (a corridor), and iii) a sharp 180°turn.The corridor environment we use in our tests is never-seen-before for both PULP-Dronet v2 and PULP-Dronet v3, not part of either CNN's training set.
We must stress that we designed these tests to challenge PULP-Dronet v2, focusing on its weaknesses specifically.While the CNN has demonstrated excellent navigation capabilities in corridor turns and dynamic obstacle avoidance, the original tests revealed that it struggles with static obstacles in narrow tunnels [5].
We evaluate our closed-loop system across three specific target speeds: v target = 0.5 , 1.0 , 1.5 m/s.We set a target flight altitude of 0.5 m in every test.We conduct five experiments for each combination of CNN and v target for statistical relevance (total 90 tests), and we compare them with the SoA PULP-Dronet v2 [5].For the sake of comparability, we always use the same drone's control state machine of PULP-Dronet v2 [5], including a first-order low-pass filter on both the forward speed of the drone v unf ilt (Eq. 1) and yaw rate ω unf ilt (Eq.2).The filtered values for the forward velocity (v f ilter ) and the yaw rate (ω f ilter ) are fed to the drone's flight controller.We set the low-pass filtering parameters α 1 = α 2 = 0.3.We use the stock hardware from Bitcraze, including i) stock motors (max current of 1 A), ii) a 250 mA h battery, and iii) stock propellers with a 45 mm diameter, for a total weight of 35 g.
The videos of the experiments are accessible through the link provided in the supplementary material section.

A. U-shaped path with static obstacles
We conduct the first set of experiments with all static obstacles.Table VII outlines the success rate for each CNN tested along with the average speed (v avg ) of the drone.If the drone fails to complete the entire path, v avg is noted as N/A.If it fails to complete a segment, we do not start the drone again on subsequent segments, and their success rates are noted as "-" when they are never attempted.PULP-Dronet v2 never succeeds at any speed point to pass segment S1 of the path.The two obstacles of segment S1 visually obstruct the entire corridor's width, resulting in a consistently high collision probability, while the CNN's steering output keeps the drone in the center of the lane defined by the surrounding walls.As a result, the drone either does not move forward due to the CNN's high collision probability, or it slowly drifts against obstacle 1, ultimately crashing.This outcome aligns with a similar scenario described in [5], where PULP-Dronet v2 encountered a 100% failure rate in tackling a narrow tunnel scenario with static obstacles.
Moving to our CNNs, we plot the trajectories of PULP-Dronet v3 in Figure 6-A-B-C, and the trajectories of Tiny-PULP-Dronet v3 in Figure 6-D-E-F.First, we analyze the results with a target speed of 0.5 m/s.PULP-Dronet v3 and Tiny-PULP-Dronet v3 fly through the whole U-shaped path with a single failure and no failures, respectively.The only time that PULP-Dronet v3 fails, it crashes against obstacle 4, still successfully completing 66% of the corridor.
Analyzing the @1 m/s target speed configuration, PULP-Dronet v3 successfully navigates the entire corridor three times In conclusion, the static obstacle avoidance success rate of PULP-Dronet v3 and Tiny-PULP-Dronet v3 is inversely proportional to the target flight speed.While PULP-Dronet v3 and Tiny-PULP-Dronet v3 show 80% and 100% success rates at 0.5 m/s, their success rates over the whole path lowers at higher speeds.PULP-Dronet v3 reduces its success rate to 60% at 1 m s, while Tiny-PULP-Dronet v3 and PULP-Dronet v3 never succeed at 1 m s and 1.5 m s, respectively.These results mark an improvement with respect to PULP-Dronet v2, showing that i) the dataset we collected with joint labels successfully trains CNNs that can tackle both obstacle avoidance and steering tasks together, and ii) our Tiny-PULP-Dronet v3, with only 2.9 k parameters, can enable static obstacle avoidance while being 168× smaller than the SoA model.

B. U-shaped path with static and dynamic obstacles
We conduct the second set of experiments stressing dynamic obstacle avoidance.We use the same setup as Section VII-A, but now obstacle 2 appears in the center of the corridor after the drone passes obstacle 1. Table VIII outlines the success rate of the three CNNs tested.Starting from the SoA CNN, PULP-Dronet v2 never succeeds in navigating any of the three segments (S1, S2, S3) of our path for the same reason described in Section VII-A, getting stuck in front of obstacle 1 and eventually crashing into it.
Moving to our CNNs, we plot the trajectories of PULP-Dronet v3 in Figure 7-A  40% when flying at 0.5, 1.0, 1.5m/s, respectively.Remarkably, PULP-Dronet v3 only crashed once against the dynamic obstacle while flying at the highest speed (1.5 m/s), always succeeding in passing the S1 segment in all other cases.On the other hand, the Tiny-PULP-Dronet v3 completes the whole corridor with a 60% success rate when flying at both 0.5 m/s and 1 m/s.However, it always fails at a higher speed of 1.5 m/s, either crashing on the right wall of segment S1 or failing to complete the turn in S2.Nevertheless, Tiny-PULP-Dronet v3 avoided the collision against the dynamic obstacle of segment S1 two times when flying at the highest speed.
In conclusion, our experiments demonstrate that for PULP-Dronet v3 and Tiny-PULP-Dronet v3, the dynamic obstacle avoidance rate is inversely proportional to the drone's flight speed when it exceeds 1 m/s.Both CNNs reliably avoid collisions with the dynamic obstacle in Section S1 100% of the time at speeds up to 1 m/s.However, at the highest speed of 1.5 m/s, the success rates decrease to 80% for PULP-Dronet v3 and 40% for Tiny-PULP-Dronet v3, respectively.VIII.CONCLUSION Nano-sized UAVs are ideal candidates as ubiquitous flying IoT nodes.In this paper, we distill a novel family of CNNs for autonomous navigation on nano-drones, i.e., the Tiny-PULP-Dronet v3 CNNs.Compared to the SoA, our models reduce memory footprint by up to 168× (down to 2.9 kB) and achieve an inference rate of up to 139 frame/s.We create a new opensource 66 k image dataset for autonomous nano-UAV navigation.We compare with in-field tests both SoA and our CNNs on a COTS nano-UAV.Our tiny CNN succeeds in navigating a challenging path with static and dynamic obstacles and a 180°turn at speeds of up to 1 m/s.In contrast, the baseline consistently fails despite having 168× more parameters.
For future development, we aim to exploit the compute and memory budget we have freed with Tiny-PULP-Dronet v3 by introducing additional tasks, e.g., executing multiple CNNs onboard [18].Additionally, efficiency can be further improved by implementing a more aggressive quantization scheme [23], e.g., 4-bit or 2-bit, coupled with quantization-aware training, to speed up computation and reduce the memory footprint while maintaining accuracy.

AFig. 2 .
Fig. 2. A) our dataset collection methodology, B) our dataset collector GUI, and C) a sample sequence from our collected dataset.

Fig. 3 .Fig. 4 .
Fig. 3. Our CNN architecture exploration includes: i) three block types -RB, D+P, IRLB; ii) an optional bypass connection (dashed line); iii) variations on the number of channels based on γ.Output feature map sizes are represented as (W idth × Height × Channels).

Fig. 5 .
Fig. 5. A) our U-shaped path for the in-field experiments.B) T op-view 2D representation of the path, highlighting its division into three segments (S1, S2, and S3) and the position of the four obstacles (represented with black rectangles).
-B-C, and the trajectories of Tiny-PULP-Dronet v3 in Figure7-D-E-F.PULP-Dronet v3 shows the highest success rate among all the CNNs tested.It completes the whole path with a success rate of 60%, 100%,
Lorenzo Lamberti (Member, IEEE) received his Ph.D. in electronic engineering from the University of Bologna, Italy, in 2024, where he now holds a researcher position.His research encompasses artificial intelligence, miniaturized robotics, ultra-lowpower embedded systems, and neuromorphic vision.His work has resulted in over 12 publications in international conferences and journals.Dr. Lamberti was the technical lead of the winning team in the "Nanocopter AI Challenge," hosted at the IMAV'22 International Conference in TU Delft.Lorenzo Bellone received his B.Sc from the Faculty of Industrial Engineering and M.Sc from the Faculty of Telecommunication Engineering in 2018 and 2021, respectively, at Politecnico di Torino, Turin, Italy.He currently holds a Researcher position at the Autonomous Robotics Research Center with the Technology Innovation Institute, Abu Dhabi, UAE.His research interests focus on Deep Learning for communication networks, UAV-aided networks, and deployment of ML solutions for aerial multi-agent systems.

TABLE I RELATED
WORKS BASED ON CNNS, RUNNING ABOARD NANO-UAVS.THE TASK IS EITHER CLASSIFICATION (C), OR REGRESSION (R).

TABLE III OF
[5] PULP-DRONET V3 CNN ARCHITECTURES, VARYING ITS BLOCKS AND BY-PASS BRANCHES.THE FIRST ROW MATCHES THE SOA BASELINE ARCHITECTURE[5].

TABLE IV ACCURACY
, RMSE, MACS, AND MEMORY FOOTPRINT OF OUR CNN ARCHITECTURES BY VARYING γ.THE LAST ROW CORRESPONDS TO TINY-PULP-DRONET V3.AND MEMORY FOOTPRINT OF OUR QUANTIZED CNNS.THE LAST ROW CORRESPONDS TO TINY-PULP-DRONET V3.

TABLE VI CNNS
' THROUGHPUT WHEN DEPLOYED TO GAP8 AT ITS MAXIMUM PERFORMANCE CONFIGURATION.THE ENERGY PER INFERENCE IS REPORTED FOR TWO CONFIGURATIONS (Eee, Emp).THE LAST ROW CORRESPONDS TO TINY-PULP-DRONET V3.