This method describes key steps including sensor selection, data collection, image processing and labeling, model training and the testing of air-based models. When these method steps are combined, they produce a final set of model performance metrics capable of answering the research question identified for investigation.
A. SENSOR SELECTION
The LWIR camera selected for this research is the FLIR (Forward-Looking Infrared) Vue Pro R. The FLIR Vue Pro R is a radiometric capable camera designed specifically for drones and costs $2,914 USD. The field of view (FOV) for the camera is 45° with a lens diameter of 6.8mm 36. The 30 Hz variant of the FLIR Vue Pro R will be used. Although the 30 Hz FLIR Vue Pro results in a higher frames per second (30 FPS) compared to the 9 Hz variant (9 FPS), the 30 Hz is export controlled and cannot be purchased outside of the United States. Both 30 Hz and 9 Hz variants produce the same LWIR resolution. The camera resolution is 336 x 256 and has a spectral band of 7.5–13.5 µm. The operating temperature range for the FLIR Vue Pro R is -20° C (-4° F) to 50° C (122° F) 36.
The RGB camera selected for this research is the RunCam 5 Orange, which is designed for drone applications and costs $110 USD. The RunCam 5 uses a Sony IMX377 12 megapixel image sensor which has a FOV of 145° with adjustable resolution, ranging from 1080P at 60 FPS to 4K at 30 FPS 37. 1080P (1920 x 1080 resolution) at 60 FPS (60 Hz) will be used for this research. Shutter speed, ISO, color style, saturation, exposure, contrast, sharpness and white balance are all set to the default settings.
B. DATA COLLECTION
Overhead imagery collection for the air-based ML models is collected from the DJI Inspire 2 (Fig. 4). The RGB and LWIR cameras on the multirotor are co-aligned to maintain the same field of view to ensure that similar images are being collected between the two sensors 38. Data are collected during various times of the day at different temperatures to ensure data diversity. Footage is recorded and extracted on the camera’s micro-SD cards. Frames of interest from the footage will then be extracted and converted into images to train the ML model. Images will also be collected from various altitudes to ensure image diversity and to help reduce model performance loss at higher altitudes.
A 3D printed component for the RGB camera was designed and printed to be able to directly mount the RGB camera to the LWIR camera. The 3D printed mounting bracket reduces parallax as well as ensures the same FOV of both cameras. This fixed FOV makes fusing the LWIR and RGB footage easier in Adobe Premier Pro. The file to print the mounting bracket can be found in the data availability section.
C. IMAGE PROCESSING, AND LABELING
IP techniques are then applied to the original images to increase the quantity available in the training dataset, while simultaneously generating edge-enhanced images to increase model performance 40. Six image processing and mathematical morphology techniques are carried out to help increase model performance which include flipping, blurring, blurring & flipping, Gaussian Thresholding (GT), Difference of Gaussians (DoG) and Sobel-XY. See Fig. 6 for a visual example of each of these techniques for RGB, LWIR and fused RGB-LWIR. The blurred and blurred + flipped IP techniques are especially useful because of video vibrations caused by the oscillatory motions from the airframe’s propellers 41. Model training on blurred images helps to ensure that the model will continue to work when frames are blurred due to camera movement, target object movement, or both. Although counterintuitive, training ML models with blurred images tends to increase detection rates and confidence levels 42. All code for generating and exporting augmented images can be found in the image processing link in the data availability section 43.
After image processing, a total of 5,400 new training images are generated, resulting in a total of 6,300 total images. 90% of the dataset (5,670 images) is used for training, 5% (315 images) is used for validation, and the remaining 5% (315 images) is used for testing. None of the newly generated images are used for testing. This is to ensure that testing results are similar across all ML models. Lastly, all images are labeled using LabelImg, which is an open-source python based image labeler 44.
D. MODEL TRAINING
This research will utilize YOLOv7 as the Convolutional Neural Network (CNN) to perform object detection 45. YOLOv7 was selected because to date it surpasses all existing object detectors in terms of speed and accuracy 46. YOLOv7 is considered one the fastest open-source object-detection models currently available 47 48 46. A primary shortfall of this family of object detection models is that YOLO approaches can struggle to detect smaller objects within an image, which is primarily due to spatial constraints in the algorithm 49 50. The standard YOLOv7 variant is used for this research study 51.
The model is trained through 55 epochs. This number was selected to prevent overtraining. There is an imbalance in the number of car and truck labels in the dataset, making overfitting a possibility if the models are trained through too many epochs 52. Cars have the most labels in the dataset while trucks have the least. Training the dataset beyond the 55 epochs selected may result in an increase in false positives, thus decreasing the mAP of the model. After the completion of training the three models (RGB, LWIR and RGB-LWIR) are ready for evaluation from drone-based imagery at different periods of day at various altitudes.
E. TESTING AIR-BASED MODELS
A multirotor drone will be utilized to fly at fixed elevations to determine inference performance via mAP for both sensors and all three model types. As indicated in Fig. 7, to assess the models and sensors new test images will be extracted from video footage, separate from those used for training, collected at 15 m (50 ft), 30 m (100 ft), 45 m (150 ft), 61 m (200 ft), 76 m (250 ft), 91 m (300 ft), 106 m (350 ft), and 121 m (400 ft). Footage cannot be collected above 121 m due to Federal Aviation Administration (FAA) drone regulation that prohibit drones from flying above 121 m (400 ft). Additionally, data will be collected at five different periods of the day. These include Pre-Sunrise (low-thermal cross-over, low illumination), Post-Sunrise (low-thermal cross-over, medium illumination), Solar-Noon (high-thermal cross-over, high illumination), Pre-Sunset (high-thermal cross-over, medium illumination) and Post-Sunset (high-thermal cross-over, low illumination). Atmospheric and location related metadata will also be recorded prior to each flight, to support both this study but also the reusability of images in future research. This metadata includes temperature (C°), wind speed (meters per second), illumination (lux), time, date, and location.
Five test images will be extracted at every elevation for each image type. This will result in 120 images (5 RGB, LWIR and RGB-LWIR images across the 8 elevations) per flight, with 600 labeled images (5 flights) per daily period. Following ten full flights, a total of 1,200 test images will be collected to evaluate model and sensor performance. When calculating mAP for test images, variables will be constrained to a confidence level of 10% with an intersection of union (IoU) of 65%. After executing the test code, the notebook will export critical metrics such as precision, recall, precision-recall curve, [email protected] and [email protected]:95. For this research, only [email protected] will be used to measure sensor and model performance at fixed elevations. The labeled test image dataset and test script can be found in the Test Data link and YOLOv7 Training Code notebook link in the data availability section.