Airborne Optical Sectioning for Nesting Observation

We describe how a new and low-cost aerial scanning technique, airborne optical sectioning (AOS), can support ornithologists in nesting observation. After capturing thermal and color images during a seven minutes drone flight over a 40 × 12 m patch of the nesting site of Austria’s largest heron population, a total of 65 herons and 27 nests could be identified, classified, and localized in a sparse 3D reconstruction of the forest. AOS is a synthetic aperture imaging technique that removes occlusion caused by leaves and branches. It registers recorded images to a common 3D coordinate system to support the reconstruction and analysis of the entire forest volume, which is impossible with conventional 2D or 3D imaging techniques. The recorded data is published with open access.


ID
bird (count and species) altitude (AGL) description views (visibility) 1D scan (visibility)  Table S1. Results of ornithologist's analysis of findspots: ID as in Fig. 4 of the main paper, bird count and species, altitude (above ground level), description, for 2D and 1D synthetic apertures: maximum number of recorded images from which findspot was within the camera's field-of-view (number of recorded images, in %, in which findspot was visible / not occluded).

Lateral and Axial Resolutions
The lateral sampling resolution f l (number of samples per unit area) at the focal plane can be determined by back-projecting the focal plane onto the imaging plane of the drone's camera (i.e., the synthetic aperture plane) 3 . Assuming a pinhole camera model, parallel focal and image planes, and a square sensor this leads to: where n is the camera's spatial resolution (number of pixels on one axis of a square image sensor), d the distance between focal plane and the synthetic aperture plane, l(d) = 2d tan(FOV /2) the side length of the area in view on the focal plane, and FOV the camera's field of view 3 . In the presence of a pose-estimation error, features are misaligned during image integration. This results in an increased and blurred feature radius, as shown in Fig. S1a. Mathematically, this can be expressed as a box function (the feature) which is convolved with another box function (the pose uncertainty). This leads to a triangle or truncated triangle shaped profile of the feature. For measuring the sampling resolution, we apply the Full Width at Half Maximum (FWHM) metric. This metric is commonly used for measuring optical resolution limits (e.g., in microscopy). Furthermore, the FWHM is well defined for a triangular profile: the maximum width of the two box functions (i.e., the feature size (l/n) and the pose uncertainty). Thus, equation (1) extents to where e(d) is the average pose-estimation error on the focal plane at distance d 3 .
The axial resolution defines the minimal distance between two focal planes which causes a distinguishable difference in the resulting images after integration. Similar as for a pose-estimation error, the image integration can be considered as a convolution of two box functions (i.e., the feature and the synthetic aperture projected onto the focal plane as shown in Fig. S1b).
Thus, by applying the FWHM metric again, the axial resolution f a (d) (the number of focal slices per unit distance) equals where the synthetic aperture size equals the size of the synthetic aperture plane projected onto the focal plane. For our field experiment, they range from 51 823 to 3224 (RGB) and 3890 to 242 (thermal) samples per square meter (lateral resolution) and 237 to 59 (RGB) and 59 to 15 (thermal) slices per meter (axial resolution). Note that the resolution of the RGB camera is limited by the pose-estimation error of 0.548 px, as the error exceeds half the size of a pixel. For the thermal  (2) and (3) for the focal stack as shown in Fig. 3. Note, that the vertical axes of these plots have logarithmic scaling. camera the pose-estimation error does not degrade the resolution as the pose estimation is below half a pixel size (0.137 px). Figure S2 plots the spatial and axial resolutions of RGB and thermal cameras for our field experiment using equations (2) and (3). Note, that both resolutions vary with altitude (distance (d) between synthetic aperture plane and focal plane). Therefore, we plot them as a function of altitude above ground level (AGL). Note, that both resolutions decrease with an increasing distance d (i.e., lower altitude AGL). Figure S3 and Table S1 compare AOS results of a one-dimensional (40 m) synthetic aperture to the results achieved with the full two-dimensional (40 m × 12 m) synthetic aperture actually being measured in our field experiment. For this, we chose only a subset of sampling positions (center scanline from waypoint 7 to waypoint 8 in Fig. 1c of the main paper).

One-dimensional vs. Two-Dimensional Synthetic Aperture
The 2D aperture was recorded in a 2-m-spaced grid-like pattern, line by line, within 7 min and resulted in a total of 130 image pairs. Thus, findspots might be in the field of view from multiple scan lines, which increases their recording interval (number of views in which a findspot is within the field of view). In our field experiment, 3 min 15 s were required on average to fully record all views of a single findspot (on average 21.5 views over 4.3 out of 7 scan lines). A recording over 3 min, however, will cause significant motion blur in case birds move during this time.
The 1D aperture, in contrast, samples only along a single scan line of approx. 40 m. This shortens the total flight time to 55 s for recording 20 image pairs. This reduces the recording interval to 4.8 views, and the recording time to 9 s per findspot on average. While the latter significantly improves motion blur of moving targets, the former does not lead to significantly worse results: Since the total field of view of the 1D aperture is smaller than the one of the 2D aperture, only 50 out of the 67 findspots were actually captured (the remaining were entirely out of the field of view at all sampling positions). But only 3 findspots that were detectable with the 2D aperture could not be identified with the 1D aperture due to occlusion (ID: 53, 58, 61 in Table S1). Furthermore, 9 findspots are visible in only one view (ID: 14, 17, 29, 33, 41, 43, 45, 48, 57 in Table S1), which makes it harder to reliably localize them in 3D, as at least two views are required for triangulation. For most cases, however, nearby features can be used to estimate the altitude.
For a quantitative comparison, we calculated the mean squared error (MSE) between thermal and RGB focal stacks computed for the 2D and for the 1D synthetic apertures over the range of 0 m to 26 m altitude above ground level. For normalized RGB and thermal values, it was 0.0021 and 0.0020, respectively.
With respect to Fig. S3, the difference between 1D and 2D apertures is higher when fewer views are available for image integration. Since less views overlap, artifacts and differences are most noticeable at high altitudes (i.e., close to the synthetic aperture plane). Focal stack slices close to the ground exhibit less noticeable differences, due to the high number of available views. Furthermore, since the bokeh of defocused features corresponds to the shape of the synthetic aperture (scaled proportionally to the amount of defocus), it changes from a 2D box-shaped to 1D line-shaped bokeh.
For the 1D aperture, parallax is recorded only in one direction, thus equation (3) can only be applied to one axis. The lateral and axial resolutions of AOS (equations (2) and (3)), however, remain the same.
All of this suggests, that single linear scanning paths with an adequate high number of samples are beneficial for reducing motion blur in case of quickly moving targets. It has been statistically proven, that the visibility gain AOS can achieve depends on the density of the occluder volume (i.e., the forest) and the number of recorded samples, but is independent of the aperture 4/8 shape 4 . Only a (density-depended) minimum spacing of sampling views has to be guaranteed. Increasing the number of recording samples does not increase the scanning time significantly, as image capturing is literally done on the fly. The scanning path only has to be extended to 2D for capturing a larger area. Figures S4 and S5 illustrate the unprocessed 8 bit RGB and color coded 14 bit thermal images as captured by the drone during the field experiment. The original 130 recorded images pairs are available online 5 , together with the pose estimation data that can be used in various software tools (e.g., Colmap or Meshlab), and the computed focal stacks of Fig. 3. Figure S3. Comparison between two-dimensional (40 m × 12 m, 130 image pairs) and one-dimensional (40 m, 20 image pairs) synthetic aperture. RGB (a-f) and thermal (g-l) focal stack slices at same altitudes (AGL), with 2D aperture (a,c,e,g,i,k) and 1D aperture (b,d,f,h,j,l) The mean squared error (MSE) indicates the quantitative difference between corresponding 1D and 2D focal stack slices.  Fig. 1c).