A total of 717,311 images were recorded in the experimental period, monitoring honeybees and other insects visiting three different plant species.
4.1. Training and Validation
First, a trained model [
56] was used to find insects in recordings from 10 different weeks and camera sites, as listed in
Table 1. These predictions generated a large number of images of candidate insects, which were subsequently verified. Images with predictions were manually corrected for
s, resulting in several images with corrected annotated insects and background images without insects. During quality checks, non-detected insects (false negatives) were found, annotated, and added to the dataset.
This dataset was used to create a final training dataset with an approximate split of 20% annotations used for validation. The training and validation dataset was manually corrected a second time based on the motion-enhanced images, and additional corrections were made. An additional 253 insects were found, an increase of 8% more annotated insects compared with the first manually corrected dataset. Two versions of the datasets were created with color and motion-enhanced images. The resulting final datasets for training and validation are listed in
Table 3.
The training and validation datasets were used to train the two different object detection methods: Faster R-CNN with ResNet50 and YOLOv5. The models were trained with color and motion-enhanced datasets as listed below:
Each model and dataset combination was trained five times. The highest validation
-score was used to select the best five models without overfitting the network. For each of the five trained models, the precision, recall,
-score, and average precision (
[email protected]) were calculated on the validation dataset.
[email protected] is calculated as the mean area under the precision–recall curve for a single class (insects) with an intersection over union (IoU) of 0.5. The averages of the five trained models are listed in
Table 4.
The results show a high recall, precision, and
-score for all models in the range of 85% to 92%. The trained models with motion-enhanced images have a recall 1–2% higher than with color images, but the precision is 4–5% lower. The trained YOLOv5 models have approximately a 1% higher
-score and 2% higher
[email protected] than Faster R-CNN. Based on the results, training with motion-enhanced images does not improve the
-score.
Training YOLOv5 and Faster R-CNN models took 18–24 h on an Intel i7 CPU 3.6 GHz and a GPU NVIDIA GeForce RTX 3090 with 24 GByte memory. Processing all camera images from the two months, shown in
Figure 4, took approximately 48 h.
4.2. Test Results and Discussion
The test dataset was created from seven different sites and weeks not included in the training and validation datasets. A separate YOLOv5 model was trained on the training and validation dataset described in
Section 4.1. This model performed inference on the selected seven sites and weeks of recordings. The results were manually evaluated, removing false predictions and searching for non-detected insects in more than 100,000 images. In total, 5737 insects were found and annotated in this first part of the iterative semi-automated process.
In the second part, two additional object detection models with Faster R-CNN and YOLOv5 were trained with motion-enhanced images. These two models performed inference on the seven sites, and predictions were compared with the first part of annotated images, resulting in the finding of an additional 619 insects. The complete test dataset is listed in
Table 2.
The test dataset contains sites with varying numbers of insects, ranging from a ratio of 1.2% to 15.3% insects compared with the number of recorded images. An average ratio of 6.2% insects was found in 102,649 images. Most of the annotated insects were honeybees, but a small number of hoverflies were found at camera site S1-1. The monitoring site S1-0 (sea rocket) contained other animals such as spiders, beetles, and butterflies. Many of the images at site S1-1 were out of focus, caused by a very short camera distance to the red clover plants. Sites S2-0 and S2-1 monitored common mallow, which was not part of the training and validation dataset. Site S4-0 had a longer camera distance to the red clover plants, where many honeybees were only barely visible. In general, many insects were partly visible due to occlusion by leaves or flowers, where only the head or abdomen of the honeybee could be seen. Additional illustrations of insect annotations and detections are included in
Supplementary Materials.
In
Table 5, the recall, precision, and
-score are shown, calculated as an average of the five trained Faster R-CNN models evaluated on the seven test sites. The Faster R-CNN models were evaluated on color and motion-enhanced images. The recall, precision, and
-score increased for all seven test sites with Faster R-CNN models trained with motion-enhanced images. The micro-average recall was increased by 15% and precision by nearly 40%, indicating that our proposed method has a huge impact on detecting small insects. This was further verified on a test dataset with a marginal distribution other than for the training and validation dataset. The
-score was increased by 24% from 0.320 to 0.555. The most difficult test site for the models to predict was S1-0, which had a low ratio of insects (1.2%) and contained animals such as spiders and beetles not present in the training dataset.
In
Table 6, the recall, precision, and
-score are shown, calculated as an average of five trained YOLOv5 models evaluated on the seven test sites. The YOLOv5 models were evaluated on color images and motion-enhanced images. The micro-average recall was increased by 28.2% and precision by only 7%. However, the micro-average
-score was increased by 22% from 0.490 to 0.713, indicating that motion-enhanced images did increase the ability to detect insects in the test dataset. The YOLOv5 models outperformed the Faster R-CNN trained models, achieving an increase of 16% for the micro-average
-score from 0.555 to 0.713.
Note that camera sites S2-0 and S2-1 (with common mallow), which were not included in the training set, performed extremely well with motion-enhanced images, achieving -scores of 0.643 and 0.618, respectively. This indicates that the dataset for training was sufficiently varied for models to detect insects in new environments. Camera site S1-1 with red clover had a lower -score than other sites with the same plant (S3-0, S4-0, and S4-1). This could be related to the foreground defocus due to the close camera distance to the plants. Camera sites S3-0, S4-0, and S4-1 had the best recall, precision, and -score. This is probably due to the high insect ratio of 4.6–15.3% and because red clover plants were heavily represented in the training dataset.
The box plot of the
-scores shown in
Figure 5 indicates an increased
-score with motion-trained models. It also shows a lower variation in the ability to detect insects between the seven different test sites, indicating a more robust detector.
Figure 4 shows the abundance of insects detected with two YOLOv5 models trained on color and motion-enhanced images over the two months of the experiment, including images from training, validation, and test datasets. False insect detections were typically found in the same spatial position of the image. A honeybee visit within the camera view typically had a duration of less than 120 s, as documented in [
28]. A filter was therefore used to remove detections for the same spatial position within two minutes in the time-lapse image sequence.
Figure 4a shows the abundance of a YOLOv5 model trained with color images. There are periods with a high difference in the filtered and non-filtered detections, probably due to a high number of false insect detections.
Figure 4b shows the abundance of a YOLOv5 model trained with motion-enhanced images. The model trained with motion-enhanced images showed in general a higher number of detections than the model trained with color images, indicating more insects were found and detected. A visual overview of the results showing the micro-average
-score for six different sites is shown in
Figure 6. Here, it is evident that MIE improves the ability to detect small insects with a variety of background plants, camera views, and distances. It can also be seen that, with a higher ratio of insects, the overall
-score is increased. Trained models with MIE are especially better at detecting insects on sites with sparse insects (Rocket top 1.2) and plants out of focus close to the camera (Clover top 2.3).
Challenges in Camera Monitoring
Automated camera monitoring of insects visiting flowering plants is a particularly exciting prospect for non-invasive monitoring of insects and other small organisms in their natural environment. Compared with traditional manual sampling methods, camera recording also has challenges. Many cameras are required to monitor a large area, which produces an immense amount of images for offline processing. Many remote nature locations where the system needs to operate without human intervention do not have power or mobile network coverage, such as the time-lapse cameras we installed and operated in East Greenland [
60].
For insect pollinators, flowering plants are used to attract the insects; however, this requires adjustment of the camera position during the flowing season to record a high abundance of insects, and this approach is difficult to standardize. Here, the challenge was to ensure that insects were visible in the camera view during monitoring. Cameras were moved to different viewing positions to ensure blooming flowers during the monitoring period of our experiment. This is probably the most important limitation for automated camera insect monitoring to ensure a high amount of insect detections.
Calibration can also be a challenge when the camera is moved and plants grow during the monitoring period. In our experiment, we used autofocus, which often focuses on the vegetation in the background and not flowers close to the camera, which the insects frequently visit. In [
56], manual focus was used to focus on flowering
Sedum plants in the foreground. However, this only works well with vegetation that has a near-constant height during the monitoring period.
Due to the small size of the objects of interest, deep learning models will often falsely identify elements in the complex background of plants as the object of interest. This is a challenge for processing and especially for monitoring areas with a low abundance of insects, since the signal-to-noise ratio in terms of TP relative to FP detections will be very low. However, we have addressed this challenge in our proposed method, which is able to increase both the recall and precision in insect detection.