Autonomous unmanned aerial vehicle flight control using multi-task deep neural network for exploring indoor environments

In recent years, owing to the advance in image processing using deep learning, autonomous unmanned aerial vehicle (UAV) navigation based on image recognition has become possible. However, several image-based deep learning methods focus primarily on single-task autonomous UAV systems, which cannot perform other required tasks. Meanwhile, deep learning methods based on multi-task learning, which are suitable for multi-tasking autonomous UAV systems, have not been sufficiently researched. Therefore, in this study, we propose a UAV flight control method that can enable correction of a UAV's self-position, self-direction, and recognition/selection of multiple movement directions using multi-task learning for exploring an unknown indoor environment, which is based only on information from monocular camera images.


Introduction
Autonomous search and navigation inside buildings using machines are highly anticipated, especially in the early stages of disaster relief and regional reconnaissance missions. It is difficult for unmanned ground vehicles (UGVs) to operate efficiently in some field environments. In such situations, unmanned aerial vehicles (UAVs) have gradually become an excellent alternative for environmental investigation missions. However, developing a safe, reliable, and autonomous UAV navigation system for indoor environments is still challenging and not accomplished.
However, GPS technology is not ideal for real-time applications in indoor and sometimes even outdoor environments, especially where tall buildings and trees can block GPS signals [1]. In such cases, it is necessary to observe the surrounding environment and make decisions regarding autonomous movement using the onboard sensor of the UAV as the primary source of information.
In the last few years, various methods for autonomous navigation using onboard sensors have been implemented. Previous studies have used laser rangefinders (light detection and ranging (LiDAR)), RGB-D sensors, stereo vision, and complex algorithms such as simultaneous localization and mapping (SLAM), which can determine the relative position of a device at any instance [2][3][4]. However, it is difficult to implement these methods on small and lightweight UAVs because of the high computational load required by complex algorithms, and the weight and size of the sensors.
Furthermore, vision-based navigation has attracted attention in aerial robotics owing to its applicability in commercial quadcopters, which are commonly equipped with forward-facing cameras. In particular, the advancement of machine learning (ML) and deep learning (DL) in the field of computer vision has demonstrated the capability of applying visionbased techniques to UAVs [5]. Previous vision-based autonomous UAV research often focused on obstacle detection and avoidance as an essential step towards safe exploration and navigation [6,7]. However, these approaches primarily build an automatic control system that concentrates only on a single task.
Actual UAV operation requires the ability to process several tasks simultaneously, and it is necessary to analyse the situation from the information acquired in each task. Obtaining the maximum information possible with limited sensors and solving numerous tasks is important in robotics research. In particular, it is very difficult to predict the direction of movement and recognize its own position from a monocular camera image in the same way that humans visualize.
This paper proposed a vision-based approach that uses multi-task learning combined with a DL method to achieve safe autonomous UAV navigation based on monocular camera images in an indoor environment exploration mission. Additionally, a custom dataset was created to evaluate the proposed method. The main contributions of this study are summarized as follows: • A multi-task learning architecture that addresses the task of predicting possible movement directions and correcting the UAV's current position and moving direction • A local motion planning policy for the proposed architecture was deployed on a real drone. Experiments on a commercial drone demonstrated the capability of the proposed method in real-life missions. • A custom dataset that contained images of various corridors with varying positions and angles within the building was constructed. It was collected using a drone equipped with a monocular camera.

SLAM-based autonomous UAV navigation
In recent years, several researchers have used a combination of LiDAR and SLAM algorithms to control UAVs in indoor environments. The SLAM algorithm is suitable for creating 3D maps of the surrounding environment using invisible data, such as from radar and LiDAR, or visual data, such as images. From the vast amount of data collected from the surrounding environments through these sensors, a UAV can build a 3D map of the environment, and simultaneously estimate its own position. Based on the created map, the current position of the UAV and the next available direction for movement can be calculated. For example, Tulldahl et al. performed experiments to demonstrate the 3D mapping capabilities of a small multirotor UAV using the Velodyne HDL-32E LiDAR [8]. Bachrach et al. generated a 3D map using an RGB-D camera and the SLAM algorithm, which was subsequently used for localization and path planning in an unknown corridor environment [9].
However, this method has several disadvantages. First, the accuracy of the created map depends on sensor quality. For example, depth cameras are greatly affected by the light conditions (e.g. ambient light) and often have a high noise level. Their capabilities are also limited for highly reflective or transparent surfaces in the environment, and they have a short range (3-5 m). Therefore, SLAM methods that use depth cameras (also called visual SLAM) may be greatly affected by this drawback and have a poor reconstruction quality. Second, equipment specifications and UAV modifications must be considered. SLAM systems require sensors (e.g. LiDARs, RGB-D and monocular cameras) and processors (e.g. Intel, AMD, etc..) with compatible operating systems/software. In addition to the sensors mentioned above, creating a detailed map for navigation with SLAM requires additional metric sensors. These complex settings are unsuitable for a lightweight UAV and may require a larger UAV, which might have difficulty in travelling through buildings in disaster areas. Finally, 3D map regeneration is extremely complex and requires significantly high computational cost and power consumption. The time required for reconstructing the surroundings is rather long, and memory consumption increases over time. Therefore, this may not be suitable for missions wherein the environment must be explored and the objectives located as quickly as possible.
Therefore, although using SLAM in the exploration system may provide more accurate information and enable safer travel, these drawbacks render SLAM techniques to not be the ideal option for autonomous exploration in indoor environments.

Deep learning
To address these problems, we primarily focus on visual-only approaches. The implementation of DL, which has obtained satisfactory results in the field of image processing, has demonstrated promising results in some studies [6,7,[10][11][12]. Most of the recently introduced studies involved dividing these approaches into two types: a trial-and-error learning strategy, which is known as reinforcement learning (RL) [13], and a supervised learning method that enables the development of end-to-end learning strategies.
RL approaches often focus on correlating raw camera inputs with the UAV's control command and combining them with the RL algorithm to facilitate model learning through demonstration. However, RL models usually require a considerable amount of experience; therefore, a lack of training conditions causes limitations in determining the model's capabilities, which raises safety concerns regarding correct UAV control and crash handling in real-world environments. Therefore, learning of RL control policies is usually implemented through a simulator (AirSim, Gazebo ROS) [14,15], but the gap between empirical and simulation models still exists, making it difficult to implement these policies in the physical world.
The second is supervised learning, which is an endto-end learning strategy wherein features are extracted and learned for a large set of learnable parameters from a given case with the correct answer. In UAVs, this is called imitation learning, wherein UAV experts control the UAV in a real-world environment and collect input images/pilots' choices of action according to the situation. The collected pilot selections are used as ground truth labels for images in the training of ML/DL models, thereby allowing models to mimic human behaviour in different situations. Previous work by Loquercio et al. [7] demonstrated that this approach can be used in cities by training DL models on data collected through cars and bicycles in an urban environment. Smoliyaskiy et al. [11], and Guesti et al. [12] developed a system for training DL models from videos collected using GoPro cameras, and they succeeded in flying an autonomous UAV that can follow forest paths.

Multi-task learning
Most research using DL-based methods focuses on solving only one problem, such as obstacle avoidance, non-collision direction searching, or determining a single movement direction. However, the capability to solve several tasks using one image and control them correctly is considered very useful in lightweight UAVs, wherein the size and weight of sensors that can be mounted are limited.
In recent years, several applications of multi-task learning, which is an ML method that solves multiple problems using a single model, have developed this idea for autonomous navigation of UAVs [16,17]. However, the development of a highly reliable model remains challenging. Because GPS cannot be used in an indoor environment, it is necessary to develop a system for UAVs that can move them using only image information. Therefore, in this study, we proposed a model that simultaneously solves three tasks in a UAV's exploration mission: predicting multiple movable directions, determining the lateral position offset, and controlling the head direction using multi-task learning. The proposed model is expected to contribute to safe operation of UAVs in indoor environments.

Problem formulation
The approach in this study is primarily based on the previous work of Viet et al. [18,19], which proposed the idea of combining the prediction of multiple directions task (situation prediction task) with the determination of the lateral position offset (position prediction task). It resulted in a multi-task learning-based model that can simultaneously execute two tasks. In [19], a dataset for the two-task learning model was constructed by collecting images captured using a camera from various positions in a corridor. For each collected image, the labels for both tasks were annotated from the general perspective of the pilot and later used as supervised learning data for the two-task learning model.
The results of a real-world experiment demonstrated the promising capability of developing a UAV system that can navigate safely in an unseen indoor environment using only images.
However, this approach contains the following unsolved problems that heavily affect the exploration missions: Unsolved problem 1 When the UAV's motors are not in an ideal condition, the "Moving Forward" command may not be executed correctly, and the UAV might turn diagonally. This problem causes the UAV to crash into walls if there are no appropriate policies to manage its current head direction ( Figure 1). Unsolved problem 2 In situations wherein the UAV approaches a crossroads, which require it to turn left or right, the angle of entry is not always 90 • (Figure 2). The problem in this situation is how the UAV checks its direction and the rotation required to safely enter the new path. Unsolved problem 3 The existing methods in [18] and [19] did not introduce any control algorithms that could use the proposed model. Specific control algorithms for these tasks are need to be designed, for these tasks must be designed, and experiments on a real UAV should be performed.
Generally, in unsolved problem 1, UAV pilots can immediately identify that they are heading toward the wall by looking at the camera image and then correcting the UAV's direction. In unsolved problem 2, they can keep rotating the UAV until it is facing the new path. Therefore, if the UAV can imitate human recognition in estimating its current situation or adjusting its direction and position using only images, safe autonomous exploration in indoor environments can be achieved using a simple monocular camera. The most efficient  method is to add a new task to address these problems. However, adding a new task may require a specific model for handling it, resulting in huge memory consumption during system processing. Therefore, in this paper, we proposed a multi-task learning-based model that can process all three tasks simultaneously, which is an extension of the model proposed by Viet et al. [19]. The three tasks were as follows:

Situation prediction task (eight-class classification)
This task estimates the available movable directions in the current image: (1)

Direction prediction task (three-class classification)
This task recognizes where the UAV is currently heading in the corridor and corrects the head direction using: (a) Centre View, (b) Left View, (c) Right View ( Figure 5).
By correcting the position and direction simultaneously, our proposed model can control the UAV more safely in the corridor, compared with that proposed by Viet et al. [19]. Moreover, the direction prediction task is expected to be effective for unsolved problem 2 ( Figure 6); when the UAV turns left or right to enter a new path (Figure 6 A), it can rotate until the foreground is confirmed to be in the Centre View class (Figure 6 B).

Data collection and preprocessing
In this study, we used the dataset created by Viet et al. [19], which was recorded with the DJI Tello Drone's front camera. It provided images with situation and position prediction task labels. The camera was placed on the corridor's centreline and faced forward while recording; therefore, the direction labels for    images in this dataset were considered as Centre View. We flew the DJI Tello Drone in various corridors and recorded the rotation angle in each image for the proposed direction prediction task's data collection. The direction in each image was classified into each class (Centre View, Left View, or Right View) based on different angle values according to the following creation procedure illustrated in Figure 7: if the angle deviation from the front exceeds 10 • , the UAV's head direction is considered as (b) Left View or (c) Right View.
Additionally, when the UAV is facing the wall, it may not estimate the available movable paths before correcting its head direction; thus for the images obtained when the UAV is facing the wall, their situation prediction task's labels are "Wall". As a result, our custom dataset includes images annotated with three labels with respect to the proposed three tasks. The distributions of each task's classes are shown in Figures 8-10.

Proposed model
Previous studies on autonomous UAV control have commonly used convolutional neural networks (CNNs) with the transfer learning technique as the primary approach for developing vision-based autonomous UAVs. Although the CNN architecture is very powerful for image classification, it often contains millions of parameters, which is problematic for tuning the parameters; thus, learning from only a few training images is difficult. Therefore, using transfer learning, the hidden layers of a pre-trained CNN that has achieved high performance on other tasks can be used as a feature extractor and then reused on the target task. As a result, the model can improve its abilities even with little data in the target task by taking advantage of the abilities learned in the source task. In accordance with the study in [19], our proposed model also used a pretrained VGG16 model's hidden layers [20] which were trained on the ImageNet dataset for feature extraction, and thereafter, we added some dense layers for the three-task learning, which were divided into separate branches corresponding to each task ( Figure 11).

Proposed control algorithm
For unsolved problem 3, a control algorithm must be created to process the results from the proposed model and perform all three tasks successfully on an actual UAV. Although other similar studies [21] using multitask learning have proposed a simple algorithm for controlling UAVs, they did not include the situation prediction task in their model; thus a new specific algorithm   is required to handle the results of the proposed model. For developing a control algorithm using the proposed model, the following aspects must be considered: • Even if the situation prediction task accuracy is high for the test data, it is possible that the model can still make mistakes in a real-world environment. • If the predicted situation is incorrect, policies are required to deal with resulting problems and safely navigate the UAV.  • The control strategies must perform all three tasks simultaneously to operate the UAV smoothly.
Considering these problems, we proposed a control algorithm with four states that operates the UAV based on the results of our proposed multi-task learning model. The relationship between the states is shown in Figure 12.
The UAV system is initialized at state 0 (initial state). After the launch, the system switches to state 1.
In state 1 (fixing direction), the head direction is adjusted ( Figure 13). First, the model predicts the current direction of the UAV from the input image. If the current direction is centre view, the system switches to state 2. Else, if the current direction is Left or Right View, the UAV rotates until the predicted direction becomes Centre View. When the operation is complete, the system switches to state 2.  In state 2 (standby/predict), the position and situation prediction tasks are performed simultaneously ( Figure 14). First, in the position prediction task (right part of Figure 14), if the current position is predicted as Centre, the UAV maintains that position. However, if the current position is not Centre, the UAV will shift to the left or right side.
The left side of Figure 14 shows the situation prediction task. First, our model predicted the current situation from the input images, and the number of processed images was counted. As discussed above, even if the accuracy of the situation prediction is high, it is still obtained using only one image. Thus the reliability is not very high. Therefore, we use the results of multiple predictions to select the final outcome. Furthermore, to determine the optimal result, three conditions are examined: multiple predictions, average probability, and number of predictions (multiple trials):

Condition 1 (Multiple predictions)
The camera was set to capture 20 images every second. In the experiments, we considered the final outcome of the situation prediction task after 1 s; thus the most common situation in the sequence of 20 predictions is selected as the current situation. Condition 2 (Average probability) We calculate the average probabilities for the situation with the largest number in Condition 1. The calculated output of the softmax function provides these probabilities. We performed some experiments and determined that 0.7 is the confidence score of the model's final predictions. If the average probability is 0.7 or more, the predicted situation is considered reliable.

Condition 3 (Multiple trials)
When the average probability is 0.7 or less, the result of the situation is considered untrustworthy and must be predicted again. Viet et al. [18] mentioned that when a UAV  approaches a situation, input images may provide more information about the situation; thus prediction result accuracy may improve. In other words, moving the UAV slightly forward and making predictions again may increase the probability of obtaining better results. Therefore, we kept moving the UAV forward for 1 s and repeated the above process. If the number of predicted times exceeds 3, the system will choose the final predicted situation and switch to state 3.
In state 3 (moving), the following commands corresponding to the situation determined in state 2 are executed ( Figure 15): (1) In the case of "Move Forward": the UAV moves forward by approximately 3 m. (2) In the cases of "Move Forward or Turn Left, " "Move Forward or Turn Right, " "Turn Left or Right, " "Turn Left, " or "Turn Right": "Move forward about 3 m " or "Move forward about 1 m and then rotate" command is executed, based on the system's movement. In the rotation phase, the UAV rotates until its direction is predicted as the "Centre View." (3) In the case of a "Dead-end ": the UAV rotates 180 • and returns.

Network training
The goal of this experiment was to verify our proposed model's performance for all the three tasks. We split the dataset into three parts for the training phase: 50% for training, 25% for the validation process, and 25% for testing the model's performance. Before the experiment, we applied some data augmentation techniques to the original data to handle the imbalance in our dataset: random zoom, random rotation, random width shifting, random height shifting, and random shear transformation. Additionally, the input size of the image was resized to match the VGG16 model's original input layers. The model optimizer and parameters were determined after some trials, and the stochastic gradient descent was chosen as the optimizer. The initial learning rate was set to 0.001, and the initial dropout rate was 50%. ImageNet's pre-trained weight was applied to each model, and the training was executed over 20 epochs with a mini-batch size of 32. We used the overall accuracy and loss for each task as metrics to evaluate the model's learning performance during training, and a confusion matrix for the test data to analyse the final performance of the model for each task. Confusion matrices were computed by computing the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).

Real-world experiments
To evaluate the robustness of the proposed model and control algorithm in real-world environments, we performed experiments using the DJI Tello Drone at a location that was not included in our dataset. The model was run on a host machine with an Intel processor, 32GB of RAM, and NVIDIA GeForce RTX 2060 GPU running on Windows 10. The DJI Tello Drone was connected to the host machine via Wi-Fi, and all the processes were executed on the host machine.

Experiment 1
The purpose of this experiment was to confirm the overall performance of our proposed model and the control algorithm towards the unsolved problems mentioned in the Problem Formulation section. The experiment was conducted under the following conditions ( Figure 16):  The evaluation criteria for our proposed method were set as follows: (1) For each phase (from states 1 to 3), the final predictions of all the tasks are accurate. The proposed model can recognize the wall and change the direction of the UAV, guide the UAV through the corridor, and recognize the situation "Move Forward or Turn Right" successfully. (2) The UAV can move safely based on the combined results of multiple predictions and average probabilities.
(3) Each state switching is performed successfully. (4) Each operation (Move Forward, Turn Right) is performed satisfactorily.

Experiment 2
The purpose of this experiment was to evaluate the effectiveness of our proposed algorithm when it is used with the proposed model, and especially the effectiveness of multiple predictions, average probabilities, and multiple trials towards overall performance. We carefully investigated previous studies and observed that there were no related works that applied the situation prediction task, except [18] and [19], thus we added some simple algorithms for these models and then used them as the baseline for this experiment: • One-task, Single Prediction, Single Trial: This algorithm controls the UAV based on the results of the situation prediction task in Viet et al. [18] ( Figure 17).  (Figure 18).
These algorithms do not apply a combination of multiple predictions, average probabilities, and multiple trials, as in proposed algorithm. The experiment was conducted under the following conditions: (1) The initial position of the UAV was located at the centreline of the corridor facing the centre of the path. (2) Approximately 10 m ahead, there is a junction where the UAV can "Move Forward" or "Turn Right." Figure 17. Algorithm for one-task, single prediction, single trial. Figure 18. Algorithm for two-task, single prediction, single trial.
(3) When the UAV approaches this junction and successfully recognizes the situation (Move Forward or Turn Right), the command "Turn Right" is set to execute.
The experiment was performed 10 times for each algorithm. An experiment was considered successful if it satisfied for the following conditions: (1) The final predictions of all the tasks were accurate.
The proposed model can guide the UAV through the corridor and recognize the situation "Move Forward or Turn Right" successfully.

Training result
Training results are shown in Figures 19-21.

Performance in real-world environment
During the experiment, the processed information was displayed on the screen of the host machine ( Figure 22).

Experiment 1: Qualitative Results of the Proposed Method
In Experiment 1, each state switching was successfully performed. When facing the wall position (Figure 23 A), the proposed model accurately performed direction prediction tasks and navigated the UAV to fly in the right direction (Figure 23 B).   The situation prediction task was performed by selecting the best prediction from 20 images (in Figure 24 A, the final situation result was Moving Forward) and had successfully made the right decisions to guide the UAV to the following situation (in Figure 24 B, the final situation result was Moving Forward or Turning Right).
When performing the "Turning Right" command (Figure 25 A) the direction prediction task correctly predicted the next path and stopped the UAV when it was facing the correct direction (Figure 25 B).
Moreover, the position predicting task kept the UAV on the centreline of the corridor throughout our experiment.

Experiment 2: Quantitative Results of Proposed Method
The results of Experiment 2 are listed in Table 1.

Training experiments
In the training experiments on the three tasks, we used a pre-trained CNN model's backbone as a feature extractor and added three branches for each task classification. As seen in Figures 20 and 21, the model     achieved satisfactory performance in position prediction and direction prediction tasks (0.91 or more for each class). Humans can easily complete these two tasks using images in real-life situations because the wall's edges and corridor's path clearly indicate the direction. This explains our model's results for these two tasks because CNN models often excel at detecting edges and shapes in images. Previous studies that combined these two tasks [7,8] also mentioned CNN's power in predicting the head direction and lateral offset. In contrast, in the situation prediction task (Figure 19), the proposed model could not achieve such good accuracy. Evidently, the performance of our proposed model decreased in comparison with the situation prediction results in [19] ( Figure 26).
Particularly, predictions for four classes (Moving Forward, Moving Forward or Turning Left, Turning Left or Turning Right, and Turning Right) decreased slightly, and Turning Left or Right's fell below 0.6. We assume that learning these three tasks with only one feature extractor has some limitations, because it is difficult to generalize the features of all the three tasks. Additionally, our study added the Wall class as the eighth class in the situation prediction task, which may have affected the other classes' prediction accuracy because the differences in image details may have triggered the model's performance for learning features. Wall included images wherein the UAV was moving towards the walls with only a few edges that could be clearly acknowledged (Figure 27 A), whereas the other classes often included images that contained both sidewalls and ceiling edges (Figure 27 B, C).
Moreover, [19] mentioned that the lack of training data and differences in position predictions might also be the reasons. Convolutional neural networks, such as CNN, often require large amounts of well-balanced training data to ensure prediction accuracy because the model tends to erroneously adjust to noisy distributions in a highly imbalanced dataset. From the distributions of classes in Figures 8, 9 and 10 and the decrease of situation prediction task's test accuracy, we assume that there was a sign of overfitting caused by imbalanced data, which resulted in poor accuracy in some classes. Additionally, the differences in the predicted position may also increase the possibility of receiving errors in situation prediction because when the UAV flies near walls (position is left/right), the field of view in a captured image is partly hidden by the walls, which causes uncertainties in predicting correctly ( Figure 28).

Real-world experiments
Although the proposed model had some difficulties in the situation prediction task, the test data were evaluated using only one image at a time. Additionally, as discussed in Section 3, because multiple images can be obtained during an actual UAV flight, we can use the results from multiple images were observed that the proposed model combined with the proposed control algorithm successfully navigated the UAV safely through the corridor, made correct situation predictions, and successfully performed the new direction prediction task. Hence, it was deduced that the proposed method solves the problems stated in the Problem Formulation section.
Moreover, from the results of Experiment 2 listed in Table 1, it is evident that the proposed method outperformed previous related methods, had no collisions, and made fewer incorrect predictions. There was only one instance wherein the proposed method made the wrong prediction in the first trial, but the average probability was under 0.7 (Figure 29 A); therefore a second trial was conducted by moving the UAV slightly forward, and in the end, the situation was predicted correctly (Figure 29 B). Therefore, this flight was still considered successful.   However, it can be seen that the frame per second (FPS) value shown on the screen was quite low (approximately 5 in this experiment) for a real-time mission, which raised problems for operating a fast and stable UAV system. We assumed that all the data in our experiment were transferred through the DJI Tello Drone's Wi-Fi connection, which may have caused a bottleneck in transmitting data, thereby dramatically decelerating the entire process. Additionally, the original VGG16 model is a large model with millions of parameters, which may not be suitable for real-time tasks.
In summary, our proposed system, which is a combination of a three-task learning CNN model and a custom control algorithm, performed well in navigating the UAV safely through the corridor and exploring the environment. Furthermore, this method can solve more tasks than other methods (Table 2) and is expected to produce an autonomous UAV system using only images for exploration and rescue missions.

Future works
First, future studies may consider using lighter CNN models, such as ResNet or MobileNet, to decrease the prediction time in actual UAV flights. Moreover, instead of using Wi-Fi to control the UAV, an embedded system that contains GPUs, such as the NVIDIA Jetson Nano, can be mounted onto the UAV to accelerate the process and allow the UAV to completely fly by itself.
Second, as discussed above, the field of view (FOV) was partly covered when the UAV approached the walls, which reduced its performance for predicting situations. This was also mentioned in [19]; the smaller the FOV, the less information is received. Therefore, cameras with a large FOV (fish-eye or omnidirectional cameras) may be employed to obtain better information about the surrounding environment. However, with completely different cameras and different FOV, this method may require a new technique of labelling data and considerable amount of time to collect data. Another solution is pre-training in a simulation environment and using the transfer learning technique to transfer knowledge across different domains.
Finally, we will also consider adding more tasks to the system (e.g. altitude adjustment, human recognition, and environment mapping), improving the model's generalization, and applying a path-planning algorithm to enhance the exploration ability of UAVs.

Conclusion
In this study, we implemented a multi-task learning strategy combined with the power of a CNN for indoor exploration missions of UAVs, to solve the problem of navigating the UAV safely and exploring the environment using only images captured with a monocular camera. We proposed a three-task learning model using a CNN and a control algorithm based on the proposed model, and created a custom dataset for training. Three experiments were conducted to evaluate the proposed method: a network training experiment, which verified the proposed model's performance on three-task learning, and two real-world experiments, which evaluated the robustness of the proposed control algorithm in real-world environments. Our proposed model demonstrated its ability to learn position and direction prediction tasks during the training, but showed some performance reductions in situation prediction tasks. However, using the proposed model's predictions in real-world experiments, our system showed its capability to make the correct decisions and navigate the UAV safely through the corridor, which is remarkable for an image-based UAV system.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Notes on contributors
Viet Duc Bui received his MS degree from the Department of Computer Science, National Defense Academy of Japan in 2021. He is currently a doctoral student at the Department of Computer Science, National Defense Academy of Japan. His research is related to different applications of computer vision in aerial robotics. His interests include computer vision, machine learning, deep neural networks, and aerial robotics.
Tomohiro Shirakawa received his BS and MS degrees from Osaka University, Japan, in 2002 and 2004, respectively, and a PhD degree from Kobe University, Japan, in 2007. From 2007, he worked in the Tokyo Institute of Technology for 3 years as a postdoctoral fellow of the Japan Society for the Promotion of Science. From 2010 to 2021, he was with the National Defense Academy of Japan and served as a research associate and a lecturer. Since 2021, he has been an associate professor at the Department of Information and Management Systems Engineering, Graduate School of Engineering, Nagaoka University of Technology, Nagaoka, Japan. He is engaged in