Improving the precision and accuracy of animal population estimates with aerial image object detection

Animal population sizes are often estimated using aerial sample counts by human observers, both for wildlife and livestock. The associated methods of counting remained more or less the same since the 1970s, but suffer from low precision and low accuracy of population estimates. Aerial counts using cost‐efficient Unmanned Aerial Vehicles or microlight aircrafts with cameras and an automated animal detection algorithm can potentially improve this precision and accuracy. Therefore, we evaluated the performance of the multi‐class convolutional neural network RetinaNet in detecting elephants, giraffes and zebras in aerial images from two Kenyan animal counts. The algorithm detected 95% of the number of elephants, 91% of giraffes and 90% of zebras that were found by four layers of human annotation, of which it correctly detected an extra 2.8% of elephants, 3.8% giraffes and 4.0% zebras that were missed by all humans, while detecting only 1.6 to 5.0 false positives per true positive. Furthermore, the animal detections by the algorithm were less sensitive to the sighting distance than humans were. With such a high recall and precision, we posit it is feasible to replace manual aerial animal count methods (from images and/or directly) by only the manual identification of image bounding boxes selected by the algorithm and then use a correction factor equal to the inverse of the undercounting bias in the calculation of the population estimates. This correction factor causes the standard error of the population estimate to increase slightly compared to a manual method, but this increase can be compensated for when the sampling effort would increase by 23%. However, an increase in sampling effort of 160% to 1,050% can be attained with the same expenses for equipment and personnel using our proposed semi‐automatic method compared to a manual method. Therefore, we conclude that our proposed aerial count method will improve the accuracy of population estimates and will decrease the standard error of population estimates by 31% to 67%. Most importantly, this animal detection algorithm has the potential to outperform humans in detecting animals from the air when supplied with images taken at a fixed rate.

Accurate animal population estimates are important for farmers to determine the value of their companies and for game managers to optimize hunting strategies (Hearne, Korrûbel, & Koch, 2000;Mwakiwa et al., 2016;Van Lavieren, 1982), while precise estimates are important to follow population trends of rare and/or valuable wildlife species (Van Lavieren, 1982). Most of the animal population estimates are based on sample counts, where animals are only counted in a part of the area and the counts are afterwards extrapolated, which is cheaper and less time-consuming than total counts (Jachmann, 2001;Norton-Griffiths, 1978;Van Lavieren, 1982).
Very often aerial counts are performed, which are the only practical way to estimate the population sizes of some animal species (Davis & Winstead, 1980;Jachmann, 2001;Norton-Griffiths, 1978;Van Lavieren, 1982). Although aerial counts are recommended for open landscapes without too much vegetation or terrain features blocking the view of the observers, many animal counts in forested areas are also performed from the air (LeResche & Rausch, 1974;Van Lavieren, 1982). The present-day method of estimating animal populations from aerial counts has not changed substantially since the 1970s (Jachmann, 2001;Nichols et al., 1996;Redfern, Viljoen, Kruger, & Getz, 2002), and suffers from imprecise and inaccurate population estimates (Davis & Winstead, 1980;Fleming & Tracey, 2008;Jachmann, 2001;LeResche & Rausch, 1974;Rabe, Rosenstock, & DeVos, 2002;Stott & Olson, 1972;Van Lavieren, 1982). Low sampling efforts, e.g. sampled kilometres, are more rule than exception in aerial counts due to the high associated costs (De Bie & Kessler, 1983;DHV Consulting Engineers, 1980;Norton-Griffiths, 1978;Redfern et al., 2002). Decreasing the costs of aerial counts could lead to an increase in sampling effort and an increase in the precision and accuracy of population estimates (Davis & Winstead, 1980;Jolly, 1969a;Van Lavieren, 1982).
Achieving accurate and precise population estimates from aerial sample counts is notoriously difficult. When comparing aerial counts to the actual population size (often estimated using multiple and/or thorough ground counts) in environments that are not completely open, aerial counts generally underestimate the number of animals by 8% to 80% (Davis & Winstead, 1980;De Bie & Kessler, 1983;Dunn, Donnelly, & Krausmann, 2002;Fleming & Tracey, 2008;LeResche & Rausch, 1974;Stott & Olson, 1972). This underestimation bias in aerial sample counts can be corrected by multiplying population estimates with a correction factor per species per stratum, viz., vegetation or terrain type, often based on a comparison of a subsample with thorough ground counts or a double-observer approach (Cook & Jacobson, 1979;Jachmann, 2002;Jolly, 1969b;Nichols et al., 2000;Norton-Griffiths, 1978;Van Lavieren, 1982). More elaborate methodological designs, i.e. distance sampling methods, make these correction factors also dependent on the animal sighting distance (Buckland et al., 2004). The correction factors that are used can be extremely large, even for very open savanna ecosystems factors as large as 13 have been reported (De Bie & Kessler, 1983), which have a large impact on the precision of population estimates (Jolly, 1969b).
Sampling effort should ideally be based on preliminary surveys, but is in practice often dictated by logistics and financial considerations (Van Lavieren, 1982). Consequently, most population estimates from aerial sample counts have such low precision that only large changes of the population can be detected (Davis & Winstead, 1980). The underestimation bias depends not only on the animal species and environmental characteristics of a certain stratum, but also on sampling effort and human skill related factors (Caughley, Sinclair, & Scott-Kemmis, 1976;Van Lavieren, 1982). Increasing the sampling effort by flying slower, lower, with narrower transects and/or more transects, should result in an increase in accuracy and precision (Jolly, 1969a(Jolly, , 1969bNorton-Griffiths, 1978;Van Lavieren, 1982). On the other hand, having a method that would not depend so much on the skill, fitness, and visual responses of different observers and pilots, could also increase the accuracy and precision of population estimates (Christie, Gilbert, Brown, Hatfield, & Hanson, 2016;Norton-Griffiths, 1978;Sirmacek et al., 2012;Van Lavieren, 1982).
An increase in sampling effort can most easily be obtained by an increase in sampling efficiency, which can be realized by a decrease in financial costs per sampling unit (Norton-Griffiths, 1978).
To simultaneously increase the sampling efficiency and standardize the animal detection system, Unmanned Aerial Vehicles (UAVs) or microlight aircrafts with cameras and an automated image object detection algorithm are considered an alternative to manned aircrafts with human observers (Colefax, Butcher, & Kelaher, 2018;Linchant, Lisein, Semeki, Lejeune, & Vermeulen, 2015;Rey, Volpi, Joost, & Tuia, 2017;Sirmacek et al., 2012). In the past decade, this proposed method has been explored and tested in various studies (Hodgson et Van Gemert et al., 2014). At first, this method seemed too far-fetched to apply in practice (Van Gemert et al., 2014), but lately, due to developments in deep learning image recognition, the performance of this method increased enough to accurately and fully automatically count animals in homogeneous and open landscapes (Andrew et al., 2017;Chamoso et al., 2014;Hodgson et al., 2018), or semi-automatically in more challenging landscapes (Kellenberger et al., 2018;Rey et al., 2017). At present, semi-automatic aerial counts with UAVs or microlights and an image object detection algorithm supplemented by human verification of the algorithm's output, could be a feasible alternative to manual aerial counts (from images and/ or directly) in any area where these manual aerial counts are appropriate (Kellenberger et al., 2018;Rey et al., 2017). The current challenges for this semi-automatic method are the small animals (in terms of pixel dimensions) in large images and the algorithms' ability to distinguish between species. However, the recent rapid development of Convolutional Neural Networks (CNNs) in image object detection has the potential to solve these issues (He, Zhang, Ren, & Sun, 2016;Lin et al., 2017;Lin, Goyal, Girshick, He, & Dollár, 2018).
Replacing manual aerial sample counts by the proposed semiautomatic method thus has the potential to improve the accuracy and precision of animal population estimates due to an increased sampling efficiency and a standardized animal detection system.
In this study we evaluate the performance of a fully convolutional multi-class one-stage detector, called RetinaNet (Lin et al., 2018), to detect elephants Loxodonta africana, giraffes Giraffa camelopardalis and plains zebras Equus quagga in aerial images from two savanna animal counts in Kenya, 2014 and 2015. The performance of this algorithm is then used to compute the difference in sampling effort that is required to compensate for the gain or loss of precision due to the over-or undercounting bias of the algorithm compared to human counts. Moreover, we compare the accuracy of population estimates using manual versus semi-automatic aerial counts by comparing the human and algorithm detections of animals versus the distance to the aircraft, because most factors that cause animal counts to be biased are dependent on the sighting distance (Van Lavieren, 1982). Furthermore, we evaluate the costs of manual versus semi-automatic aerial counts and determine how much the sampling efficiency will improve when manual aerial counts are replaced with semi-automatic counts for the same total expenditure. Finally, we combine both the cost analysis and the animal detection algorithm to compute how much the precision of population estimates can improve when manual aerial counts are replaced with semi-automatic aerial counts.

| Detection algorithm
We used the fully convolutional multi-class one-stage detector RetinaNet to detect animals in the aerial images (Lin et al., 2018).
The architecture of RetinaNet is composed of a backbone network and two task-specific subnetworks ( Figure 2). This algorithm takes entire images as input, constructs feature maps at different scale levels, generates region proposals at each scale level by means of 'anchors', and performs classification and bounding box regression for each anchor to predict the presence and location of objects in the input images. By using the Focal Loss function, training focusses on cases that are hard to classify, leading to high detection accuracy.
For details about RetinaNet, we refer to the study that proposed this algorithm (Lin et al., 2018).
The algorithm was trained for 50 epochs with a batch size of 1 image and again for 50 epochs with a batch size of 2 images, which took on average 7 min per epoch for a batch size of 1 and 5 min for a F I G U R E 1 Example of an image containing giraffe and zebra. Top: entire image; Middle: part of image zoomed in, with annotated bounding boxes (red for giraffe and blue for zebra). Bottom: annotated animals from left to right batch size of 2. Batch sizes up to 7 were possible within the memory constraints of the system, but to limit the total computation time we stopped training at the batch size that performed less than the smaller batch size.
The 561 images were randomly divided into a training (70%), validation (10%) and test (20%) set (Table 1) could not be used in their entirety as training input for the algorithm due to memory limitations. Therefore, the training images were first divided into 42 (7 × 6) equal-sized 'tiles' each with 200 pixels overlap on all sides, resulting in tiles of on average 900 × 700 pixels. The overlap between tiles sometimes resulted in the same animal being present in several tiles, but made sure that every annotated animal was at least once fully present in a tile. Sometimes animals were only partly covered in a tile, but then we cut off a part of the tile to remove as much of this partly covered animal without removing the fully covered animals in the tile. We only trained the algorithm with the annotations from animals of which the bounding box was entirely covered in the tile, and all partly covered animals were considered to be background. Also all the animals of species other than our three focal species were considered to be background. Furthermore, all training tiles were horizontally mirrored to be used as extra training data. We only used the tiles that contained at least one fully covered animal as training input for the algorithm, which resulted in approximately 10% of all the generated tiles. As the same example of an individual animal was sometimes present in several overlapping training tiles and because the total amount of training tiles was

| Algorithm evaluation
After each epoch during training, the resulting algorithm was saved and evaluated on a validation set. Detection was done on whole images and took on average 1.5 s per image, during which the images simply had to be forwarded through the trained algorithm. The algorithm that performed the best on the validation set was eventually evaluated on a test set, which also consisted of whole images.
The annotated bounding boxes were compared with the predicted bounding boxes per species using the Intersection over Union (IoU) measure, defined as the area of overlap divided by the area of union. A combination of an annotated box and a predicted box was potentially considered a True Positive (TP) when the IoU was equal to or larger than a certain threshold. However, every annotation and every prediction could only be used once to generate a TP, where priority was given to the combination with the largest IoU. Every prediction that was not coupled with an annotation after this process was considered a potential False Positive (FP) and every annotation that was not coupled with a prediction a False Negative (FN). Finally, we checked if the FPs were not actually TPs that were missed by the four layers of human annotation: (1) taking a photograph from the air upon spotting a group of animals; (2) counting the individual animals   be present in a larger area. Therefore, the animal detection method that is more accurate and thus less influenced by the distance to the aircraft, should have a detection curve that is more similar to the geometric expectation.

| Population estimate precision
The over-or undercounting bias by the algorithm compared to the manually annotated animals requires an extra correction factor for the population estimate on top of the regularly used correction factor for manual counts. The extrapolation of population sizes based on insight gathered in previously undertaken surveys, for example using aerial/ground count comparisons, distance sampling methods and/or double-observer approaches, can thus still be used but should now be supplemented with this extra correction factor. This counting bias can be computed by dividing the algorithm's recall with the humans' recall, with the inverse of this counting bias to be used as the correction factor. This should be done per species and ideally per stratum, viz., vegetation or terrain type, as well. This correction factor can be larger or smaller than one, depending on whether the algorithm counts less or more animals than humans did. When the population estimate is multiplied with a constant correction factor, the standard error of the population estimate changes with the same factor. To compensate for the change in standard error of the estimate, a change in sampling effort of the area is needed. Equation 3 summarizes the solution about how much the sampling effort needs to change to achieve a standard error of the (semi-)automatic method that is equal to the manual method (see Supporting Information): where N is the index detailing how much the number of sampling units needs to change, and r is the counting bias. where S is the index detailing what the standard error of the population estimate from semi-automatic counts will be compared to manual counts by spending the same amount of finances on the semi-automatic method as on the manual method, and k is the increase in sampling units.

| Algorithm evaluation
The algorithms trained with a batch size of 1 performed overall better than the algorithms with a batch size of 2 ( to the training images, this algorithm achieved a mAP of 0.86, with an AP of 0.87 on elephant, 0.92 on giraffe and 0.80 on zebra ( Figure   S2). This suggests that the training set size is acceptable, but possibly the detection of giraffe (the least occurring species in our dataset) would improve with more training data ( Table 1) (Figure 4). F 1 -scores are good performance indicators when the algorithm is considered to be used in a fully automatic animal detection system. The predicted bounding boxes with a score above the mean of the score thresholds corresponding to the maximum F 1 -scores detect many of the annotated animals, but miss the inconspicuous ones to prevent predicting many FPs ( Figure 5). There was little confusion between species at this score threshold, for elephant 4% of the predictions were on other species, for giraffe 5% and for zebra 1%.
The top 17% of the FPs with the highest score confidences (361 bounding boxes, with a score confidence between 1 and 0.4) were visually checked to determine if they were not actually TPs that were missed by the four layers of human annotation. It turned out that 20 of these detected bounding boxes (4 elephants, 8 giraffes and 8 zebras) were actually correctly predicted, but missed during annotations (Figure 6a-c). Furthermore, 25 other detected bounding boxes (10 elephants, 6 giraffes and 9 zebras) could potentially also be TPs

| Population estimate precision
Using the minimum of the three aforementioned values of the algorithm's undercounting bias compared to human annotations (r = 0.9), it follows from Equation 3 that the increase (N) in sampling units that is needed to achieve a decrease in standard error of the population estimate (S Y ) equal to r is as follows: The costs per sampled kilometre of various helicopter and fixedwing manual count methods, using direct observations, manual image verification and a combination, were compared with the expected costs of our proposed semi-automatic method using UAVs and microlights with an animal detection algorithm (see Supporting Information), from which follows that the costs can be reduced by a factor 2.6 to 11.5. Using these factors (k = 2.6 to 11.5) and the undercounting bias (r = 0.9), it follows from Equation 4 that the factor by which the standard error of the population estimate will change (S) is as follows: to N = n automatic n manual = r −2 = 0.9 −2 ≈ 1.23.  is thus possible to sample 160% to 1,050% more units, e.g. kilometres, for the same costs, which will result in an increase of the accuracy of animal population estimates and an overall decrease of the standard error of 31% to 67%. Furthermore, this standard error will likely decrease even more in practice because UAVs or microlights carrying cameras with zoom lenses can probably fly higher than manned aircrafts whilst still being able to count the animals.

| D ISCUSS I ON
This will increase the sampling efficiency further. Finally, the animal detections by the algorithm were less sensitive to the sighting distance than the human detections were. This highlights the fact that Our animal detection algorithm performed better than previously published algorithms for aerial imagery of mammals in similar habitats (Kellenberger et al., 2018;Rey et al., 2017;Sirmacek et al., 2012), whilst also being able to differentiate between animal species. Furthermore, this is likely the first evidence of an algorithm detecting animals that multiple layers of humans were not able to detect. Therefore, we posit that this algorithm is likely to outperform humans in the detection of animals from an aircraft when images are taken at a fixed interval, instead of only when animals are spotted by human observers.
As with many new technological developments, implementing this semi-automatic method requires some initial work and potentially schooling of personnel. An animal detection algorithm should first be trained for a new area and animal species and verified with manual count data to compute performance measures.
However, previously collected aerial image footage can potentially be used as input data for this task. Annotating the images by drawing bounding boxes around the animals is the most time-consuming part of this process, which took us on average two minutes per image per person. Training and validating the algorithm is a matter of hours when running the analysis on a dedicated server with pre-installed software. As with all deep learning applications, performance improves with more training data, which implicates that rare species will be more difficult to distinguish from other species. In a semi-automatic approach this can be dealt with by merging for example all medium-sized antelopes into one class for the algorithm, with humans doing afterwards the species determination. Moreover, as cameras can be equipped with zoom lenses and can generate high-resolution images, the potential can be explored to count smaller animal species as well than done in manual aerial counts. Furthermore, some argued that UAVs are less suited for animal counts compared to manned aircrafts because of their smaller action radius (Christie et al., 2016). We concur that UAV flying can be restricted to the pilot's line of sight due to national legislations, which causes microlights to be more ideal for semiautomatic game counts in large, hilly and/or densely vegetated areas. However, because of the low costs of UAV equipment versus manned aircrafts, it is possible for a single UAV pilot to transport multiple batteries, mobile battery chargers and/or UAVs for a single animal count and therefore still cover a large area per day.
Moreover, factors that influence animal visibility still impact the semi-automatic counts, just like they impact manual aerial animal counts. However, practices that can be employed to correct manual aerial counts for varying animal visibility, e.g. detection curves in distance sampling methods (probability of spotting an animal vs. distance to aircraft), sighting-probability models (probability of spotting an animal vs. various external factors), area division in strata and ground count comparisons (Buckland et al., 2004;Norton-Griffiths, 1978), can still be applied to semi-automatic counts. In this study we computed a single correction factor in order to derive the potential increase in population estimates' precision, but this should in practice be done per species per stratum and can be a function of sighting distance and external factors as well. With semi-automatic counts there is also the potential to estimate the fraction of animals that are missed by both humans and the algorithm together by using the double-observer approach during the model building phase (where one observer is now the algorithm), which could give extra information about the actual population sizes (Cook & Jacobson, 1979;Nichols et al., 2000).
Furthermore, as UAVs and microlights can carry on-board sensors that accurately monitor and record speed, altitude and tilt and the aerial imagery allows all kind of landscape, terrain and vegetation characteristics to be recorded for every part of the count, it is now possible to further modernize aerial animal count practices.
All these external data can be used to create detection curves that are not fixed for a certain area or stratum, but dynamic over the whole count. Therefore, the computation of population estimates can potentially be done in a more continuous fashion without the need to choose a discrete set of strata.
Using our proposed semi-automatic aerial animal count method, instead of manual aerial counts, with the same total expenditure will result in a better accuracy and precision of animal population estimates. This semi-automatic method is influenced far less by factors such as animal group size, aircraft speed and observer fatigue, experience and skill, because it is far less dependent on human observations. This causes the counts of the semi-automatic method to be more consistent over a variety of conditions than manual counts. As the performance of image object detection algorithms and the action radius and autonomy of UAVs improve rapidly (Christie et al., 2016;Lin et al., 2018), we contend that the population estimates of this semi-automatic aerial animal count method will improve even further over time and that aerial animal counts can become fully automatic in the near future. 206-3e5a-4673-b830-b5c86 6445b8c (Eikelboom, 2019).